Changeset 14e1053 for doc/theses/colby_parsons_MMAth/text/waituntil.tex
- Timestamp:
- Jun 27, 2023, 4:45:40 PM (12 months ago)
- Branches:
- master
- Children:
- a1f0cb6
- Parents:
- 917e1fd
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/colby_parsons_MMAth/text/waituntil.tex
r917e1fd r14e1053 80 80 This enables fully expressive \gls{synch_multiplex} predicates. 81 81 82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe }, and C++14's @when_any@ over futures~\cite{cpp:whenany}.82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}. 83 83 Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources. 84 84 … … 99 99 All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls. 100 100 The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}. 101 No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil. 102 All others them tie themselves to some specific type of resource. 101 103 102 104 \begin{figure} … … 370 372 371 373 \subsection{Channel Benchmark} 372 The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels. 373 374 %C_TODO explain benchmark 375 376 %C_TODO show results 377 378 %C_TODO discuss results 374 The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels. 375 The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads. 376 The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses. 377 Each clause has a respective channel that is operates on. 378 Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels. 379 An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows. 380 381 \begin{cfa} 382 for (;;) 383 waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {} 384 or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {} 385 \end{cfa} 386 A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds. 387 The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels. 388 The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively. 389 390 \begin{figure} 391 \centering 392 \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize} 393 \subfloat[AMD]{ 394 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}} 395 } 396 \subfloat[Intel]{ 397 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}} 398 } 399 \bigskip 400 401 \subfloat[AMD]{ 402 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}} 403 } 404 \subfloat[Intel]{ 405 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}} 406 } 407 \bigskip 408 409 \subfloat[AMD]{ 410 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}} 411 } 412 \subfloat[Intel]{ 413 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}} 414 } 415 \caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).} 416 \label{f:select_contend_bench} 417 \end{figure} 418 419 \begin{figure} 420 \centering 421 \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize} 422 \subfloat[AMD]{ 423 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}} 424 } 425 \subfloat[Intel]{ 426 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}} 427 } 428 \bigskip 429 430 \subfloat[AMD]{ 431 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}} 432 } 433 \subfloat[Intel]{ 434 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}} 435 } 436 \bigskip 437 438 \subfloat[AMD]{ 439 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}} 440 } 441 \subfloat[Intel]{ 442 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}} 443 } 444 \caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).} 445 \label{f:select_spin_bench} 446 \end{figure} 447 448 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@. 449 In the AMD benchmarks, the performance is very similar as the number of cores scale. 450 The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go. 451 At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler. 452 Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}. 453 In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale. 454 This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@. 455 Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase. 456 In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel. 457 This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs. 458 459 The Go approach of holding all internal channel locks in the select has some additional drawbacks. 460 This approach results in some pathological cases where Go's system throughput on channels can greatly suffer. 461 Consider the case where there are two channels, @A@ and @B@. 462 There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@. 463 Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@. 464 Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired. 465 This case may not be as pathological as it may seem. 466 If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel. 467 The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity. 468 Comparison of this pathological case is shown in Table~\ref{t:pathGo}. 469 The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine. 470 471 \begin{table}[t] 472 \centering 473 \setlength{\extrarowheight}{2pt} 474 \setlength{\tabcolsep}{5pt} 475 476 \caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation} 477 \label{t:pathGo} 478 \begin{tabular}{*{5}{r|}r} 479 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\ 480 \hline 481 AMD & \input{data/nasus_Order} \\ 482 \hline 483 Intel & \input{data/pyke_Order} 484 \end{tabular} 485 \end{table} 486 487 Another difference between Go and \CFA is the order of clause selection when multiple clauses are available. 488 Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}. 489 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in Table~\ref{}. 490 If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection. 379 491 380 492 \subsection{Future Benchmark} 381 493 The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures. 382 383 %C_TODO explain benchmark 384 385 %C_TODO show results 386 387 %C_TODO discuss results 494 Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic. 495 They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@. 496 The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate. 497 498 \begin{figure} 499 \centering 500 \subfloat[AMD Future Synchronization Benchmark]{ 501 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}} 502 \label{f:futureAMD} 503 } 504 \subfloat[Intel Future Synchronization Benchmark]{ 505 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}} 506 \label{f:futureIntel} 507 } 508 \caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).} 509 \caption{} 510 \label{f:futurePerf} 511 \end{figure} 512 513 This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements. 514 This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable. 515 Results of this benchmark are shown in Figure~\ref{f:futurePerf}. 516 Each set of columns is marked with a name representing the predicate for that set of columns. 517 The predicate name and corresponding waituntil statement is shown below: 518 519 \begin{cfa} 520 #ifdef OR 521 waituntil( A ) { get( A ); } 522 or waituntil( B ) { get( B ); } 523 or waituntil( C ) { get( C ); } 524 #endif 525 #ifdef AND 526 waituntil( A ) { get( A ); } 527 and waituntil( B ) { get( B ); } 528 and waituntil( C ) { get( C ); } 529 #endif 530 #ifdef ANDOR 531 waituntil( A ) { get( A ); } 532 and waituntil( B ) { get( B ); } 533 or waituntil( C ) { get( C ); } 534 #endif 535 #ifdef ORAND 536 (waituntil( A ) { get( A ); } 537 or waituntil( B ) { get( B ); }) // brackets create higher precedence for or 538 and waituntil( C ) { get( C ); } 539 #endif 540 \end{cfa} 541 542 In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators. 543 For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators. 544 Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
Note: See TracChangeset
for help on using the changeset viewer.