Changeset 14e1053 for doc/theses/colby_parsons_MMAth/text
- Timestamp:
- Jun 27, 2023, 4:45:40 PM (21 months ago)
- Branches:
- master
- Children:
- a1f0cb6
- Parents:
- 917e1fd
- Location:
- doc/theses/colby_parsons_MMAth/text
- Files:
-
- 1 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶
r917e1fd r14e1053 5 5 % ====================================================================== 6 6 7 % C_TODO: add citations throughout chapter8 7 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment. 9 8 Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. … … 423 422 Push operations are amortized $O(1)$ since pushes may cause doubling reallocations of the underlying dynamic-sized array (like \CC @vector@). 424 423 425 % C_TODO: maybe make copy_queue diagram426 427 424 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed. 428 425 For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations. … … 834 831 In another example, if the average gulp size is very high, it could indicate that the executor could use more queue sharding. 835 832 836 % C_TODO cite poison pill messages and add languages837 833 Another productivity feature that is included is a group of poison-pill messages. 838 Poison-pill messages are common across actor systems, including Akka and ProtoActor \cite{}.834 Poison-pill messages are common across actor systems, and are used in actor libraries Akka and ProtoActor~\cite{Akka,ProtoActor}. 839 835 Poison-pill messages inform an actor to terminate. 840 836 In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills. … … 881 877 & \multicolumn{1}{c|}{\CFA (100M)} & \multicolumn{1}{c|}{CAF (10M)} & \multicolumn{1}{c|}{Akka (100M)} & \multicolumn{1}{c|}{\uC (100M)} & \multicolumn{1}{c@{}}{ProtoActor (100M)} \\ 882 878 \hline 883 AMD & \input{data/ pykeSendStatic} \\879 AMD & \input{data/nasusSendStatic} \\ 884 880 \hline 885 Intel & \input{data/ nasusSendStatic}881 Intel & \input{data/pykeSendStatic} 886 882 \end{tabular} 887 883 … … 894 890 & \multicolumn{1}{c|}{\CFA (20M)} & \multicolumn{1}{c|}{CAF (2M)} & \multicolumn{1}{c|}{Akka (2M)} & \multicolumn{1}{c|}{\uC (20M)} & \multicolumn{1}{c@{}}{ProtoActor (2M)} \\ 895 891 \hline 896 AMD & \input{data/ pykeSendDynamic} \\892 AMD & \input{data/nasusSendDynamic} \\ 897 893 \hline 898 Intel & \input{data/ nasusSendDynamic}894 Intel & \input{data/pykeSendDynamic} 899 895 \end{tabular} 900 896 \end{table} … … 916 912 The results from the static/dynamic send benchmarks are shown in Figures~\ref{t:StaticActorMessagePerformance} and \ref{t:DynamicActorMessagePerformance} respectively. 917 913 \CFA leads the charts in both benchmarks, largely due to the copy queue removing the majority of the envelope allocations. 914 Additionally, the receive of all messages sent in \CFA is statically known and is determined via a function pointer cast, which incurrs a compile-time cost. 915 All the other systems use their virtual system to find the correct behaviour at message send. 916 This requires two virtual dispatch operations, which is an additional runtime send cost that \CFA does not have. 917 Note that Akka also statically checks message sends, but still uses their virtual system at runtime. 918 918 In the static send benchmark all systems except CAF have static send costs that are in the same ballpark, only varying by ~70ns. 919 919 In the dynamic send benchmark all systems experience slower message sends, as expected due to the extra allocations. … … 1084 1084 1085 1085 Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores. 1086 \CFA has a high watermark relative to the other non-garbage 1086 \CFA has a high watermark relative to the other non-garbage-collected systems \uC, and CAF. 1087 1087 This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed. 1088 Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems. 1088 1089 1089 1090 \subsection{Matrix Multiply} -
TabularUnified doc/theses/colby_parsons_MMAth/text/waituntil.tex ¶
r917e1fd r14e1053 80 80 This enables fully expressive \gls{synch_multiplex} predicates. 81 81 82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe }, and C++14's @when_any@ over futures~\cite{cpp:whenany}.82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}. 83 83 Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources. 84 84 … … 99 99 All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls. 100 100 The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}. 101 No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil. 102 All others them tie themselves to some specific type of resource. 101 103 102 104 \begin{figure} … … 370 372 371 373 \subsection{Channel Benchmark} 372 The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels. 373 374 %C_TODO explain benchmark 375 376 %C_TODO show results 377 378 %C_TODO discuss results 374 The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels. 375 The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads. 376 The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses. 377 Each clause has a respective channel that is operates on. 378 Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels. 379 An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows. 380 381 \begin{cfa} 382 for (;;) 383 waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {} 384 or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {} 385 \end{cfa} 386 A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds. 387 The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels. 388 The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively. 389 390 \begin{figure} 391 \centering 392 \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize} 393 \subfloat[AMD]{ 394 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}} 395 } 396 \subfloat[Intel]{ 397 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}} 398 } 399 \bigskip 400 401 \subfloat[AMD]{ 402 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}} 403 } 404 \subfloat[Intel]{ 405 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}} 406 } 407 \bigskip 408 409 \subfloat[AMD]{ 410 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}} 411 } 412 \subfloat[Intel]{ 413 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}} 414 } 415 \caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).} 416 \label{f:select_contend_bench} 417 \end{figure} 418 419 \begin{figure} 420 \centering 421 \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize} 422 \subfloat[AMD]{ 423 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}} 424 } 425 \subfloat[Intel]{ 426 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}} 427 } 428 \bigskip 429 430 \subfloat[AMD]{ 431 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}} 432 } 433 \subfloat[Intel]{ 434 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}} 435 } 436 \bigskip 437 438 \subfloat[AMD]{ 439 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}} 440 } 441 \subfloat[Intel]{ 442 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}} 443 } 444 \caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).} 445 \label{f:select_spin_bench} 446 \end{figure} 447 448 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@. 449 In the AMD benchmarks, the performance is very similar as the number of cores scale. 450 The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go. 451 At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler. 452 Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}. 453 In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale. 454 This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@. 455 Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase. 456 In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel. 457 This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs. 458 459 The Go approach of holding all internal channel locks in the select has some additional drawbacks. 460 This approach results in some pathological cases where Go's system throughput on channels can greatly suffer. 461 Consider the case where there are two channels, @A@ and @B@. 462 There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@. 463 Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@. 464 Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired. 465 This case may not be as pathological as it may seem. 466 If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel. 467 The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity. 468 Comparison of this pathological case is shown in Table~\ref{t:pathGo}. 469 The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine. 470 471 \begin{table}[t] 472 \centering 473 \setlength{\extrarowheight}{2pt} 474 \setlength{\tabcolsep}{5pt} 475 476 \caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation} 477 \label{t:pathGo} 478 \begin{tabular}{*{5}{r|}r} 479 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\ 480 \hline 481 AMD & \input{data/nasus_Order} \\ 482 \hline 483 Intel & \input{data/pyke_Order} 484 \end{tabular} 485 \end{table} 486 487 Another difference between Go and \CFA is the order of clause selection when multiple clauses are available. 488 Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}. 489 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in Table~\ref{}. 490 If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection. 379 491 380 492 \subsection{Future Benchmark} 381 493 The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures. 382 383 %C_TODO explain benchmark 384 385 %C_TODO show results 386 387 %C_TODO discuss results 494 Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic. 495 They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@. 496 The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate. 497 498 \begin{figure} 499 \centering 500 \subfloat[AMD Future Synchronization Benchmark]{ 501 \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}} 502 \label{f:futureAMD} 503 } 504 \subfloat[Intel Future Synchronization Benchmark]{ 505 \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}} 506 \label{f:futureIntel} 507 } 508 \caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).} 509 \caption{} 510 \label{f:futurePerf} 511 \end{figure} 512 513 This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements. 514 This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable. 515 Results of this benchmark are shown in Figure~\ref{f:futurePerf}. 516 Each set of columns is marked with a name representing the predicate for that set of columns. 517 The predicate name and corresponding waituntil statement is shown below: 518 519 \begin{cfa} 520 #ifdef OR 521 waituntil( A ) { get( A ); } 522 or waituntil( B ) { get( B ); } 523 or waituntil( C ) { get( C ); } 524 #endif 525 #ifdef AND 526 waituntil( A ) { get( A ); } 527 and waituntil( B ) { get( B ); } 528 and waituntil( C ) { get( C ); } 529 #endif 530 #ifdef ANDOR 531 waituntil( A ) { get( A ); } 532 and waituntil( B ) { get( B ); } 533 or waituntil( C ) { get( C ); } 534 #endif 535 #ifdef ORAND 536 (waituntil( A ) { get( A ); } 537 or waituntil( B ) { get( B ); }) // brackets create higher precedence for or 538 and waituntil( C ) { get( C ); } 539 #endif 540 \end{cfa} 541 542 In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators. 543 For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators. 544 Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
Note: See TracChangeset
for help on using the changeset viewer.