Changeset 4acf56d for doc/theses/colby_parsons_MMAth/text
- Timestamp:
- Jul 13, 2023, 9:37:22 PM (22 months ago)
- Branches:
- master
- Children:
- b7c53a9d
- Parents:
- 09e400e (diff), a3c7bac (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - Location:
- doc/theses/colby_parsons_MMAth/text
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
TabularUnified doc/theses/colby_parsons_MMAth/text/CFA_intro.tex ¶
r09e400e r4acf56d 60 60 61 61 62 \section{\lstinline{with} Statement} 62 \section{\lstinline{with} Statement}\label{s:with} 63 63 The \CFA @with@ statement is for exposing fields of an aggregate type within a scope, allowing field names without qualification. 64 64 This feature is also implemented in Pascal~\cite{Pascal}. … … 82 82 83 83 84 \section{Operators} 84 \section{Operators}\label{s:Operators} 85 85 Operators can be overloaded in \CFA with operator routines. 86 86 Operators in \CFA are named using an operator symbol and '@?@' to represent operands. -
TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶
r09e400e r4acf56d 5 5 % ====================================================================== 6 6 7 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment. 8 Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. 9 The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation and the \gls{actor_system}, which refers to the implementation of the model. 7 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency. 8 Hence, actors are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. 9 Actor message-passing is similar to channels, but with more abstraction, so there is no shared data to protect, making actors amenable in a distributed environment. 10 The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation, and the \gls{actor_system}, which refers to the implementation of the model. 10 11 Before discussing \CFA's actor system in detail, it is important to first describe the actor model, and the classic approach to implementing an actor system. 11 12 … … 20 21 An actor is executed by an underlying \Newterm{executor} (kernel thread-pool) that fairly invokes each actor, where an actor invocation processes one or more messages from its mailbox. 21 22 The default number of executor threads is often proportional to the number of computer cores to achieve good performance. 22 An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{ end ofSection~\ref{s:ActorSystem}}.23 An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{Section~\ref{s:ActorSystem}}. 23 24 24 25 \subsection{Classic Actor System} … … 31 32 Some actor systems provide a shared mailbox where multiple actors receive from a common mailbox~\cite{Akka}, which is contrary to the no-sharing design of the basic actor-model (and requires additional locking). 32 33 For non-\gls{fifo} service, some notion of fairness (eventual progress) must exist, otherwise messages have a high latency or starve, \ie never received. 33 Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}} ).34 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ willarrive at actor $j$ in the order they were sent.34 Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}}. 35 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ arrive at actor $j$ in the order they were sent. 35 36 Another way an actor system varies from the model is allowing access to shared global-state. 36 37 When this occurs, it complicates the implementation as this breaks any implicit mutual-exclusion guarantees when only accessing local-state. … … 173 174 @actor | finished_msg;@ $\C{// send => terminate actor (deallocation deferred)}$ 174 175 stop_actor_system(); $\C{// waits until actors finish}\CRT$ 175 } // deallocate int_msg, str_msg, actor176 } // deallocate actor, int_msg, str_msg 176 177 \end{cfa} 177 178 \caption{\CFA Actor Syntax} … … 181 182 Figure~\ref{f:CFAActor} shows a complete \CFA actor example, which is discussed in detail. 182 183 The actor type @my_actor@ is a @struct@ that inherits from the base @actor@ @struct@ via the @inline@ keyword. 183 This inheritance style is the Plan-9 C-style inheritance discussed in Section~\ref{s:Inheritance}.184 This inheritance style is the Plan-9 C-style \see{Section~\ref{s:Inheritance}}. 184 185 Similarly, the message types @str_msg@ and @int_msg@ are @struct@s that inherits from the base @message@ @struct@ via the @inline@ keyword. 185 186 Only @str_msg@ needs a constructor to copy the C string; 186 187 @int_msg@ is initialized using its \CFA auto-generated constructors. 187 188 There are two matching @receive@ (behaviour) routines that process the corresponding typed messages. 188 Both @receive@ routines use a @with@ clause so message fields are not qualified and return @Nodelete@ indicating the actor is not finished.189 Both @receive@ routines use a @with@ clause so message fields are not qualified \see{Section~\ref{s:with}} and return @Nodelete@ indicating the actor is not finished \see{Section~\ref{s:ActorBehaviours}}. 189 190 Also, all messages are marked with @Nodelete@ as their default allocation state. 190 191 The program main begins by creating two messages on the stack. 191 Then the executor system is started by calling @start_actor_system@ .192 Now an actor is created on the stack and four messages are sent to it using operator @?|?@ .193 The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{ Section~\ref{s:ActorBehaviours}}.192 Then the executor system is started by calling @start_actor_system@ \see{Section~\ref{s:ActorSystem}}. 193 Now an actor is created on the stack and four messages are sent to it using operator @?|?@ \see{Section~\ref{s:Operators}}. 194 The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{end of Section~\ref{s:ActorBehaviours}}. 194 195 The call to @stop_actor_system@ blocks the program main until all actors are finished and removed from the actor system. 195 The program main ends by deleting the actor and t wo messages from the stack.196 The program main ends by deleting the actor and the two messages from the stack. 196 197 The output for the program is: 197 198 \begin{cfa} … … 274 275 \noindent@void start_actor_system()@ 275 276 configures the executor to implicitly use all preallocated kernel-threads (processors), \ie the processors created by the program main prior to starting the actor system. 277 For example, the program main declares at the start: 278 \begin{cfa} 279 processor p[3]; 280 \end{cfa} 281 which provides a total of 4 threads (3 + initial processor) for use by the executor. 276 282 When the number of processors is greater than 1, each executor's message queue is sharded by a factor of 16 to reduce contention, \ie for 4 executor threads (processors), there is a total of 4 $\times$ 16 message queues evenly distributed across the executor threads. 277 283 … … 279 285 configures the number of executor threads to @num_thds@, with the same message queue sharding. 280 286 287 \begin{sloppypar} 281 288 \noindent@void start_actor_system( executor & this )@ 282 289 allows the programmer to explicitly create and configure an executor for use by the actor system. 283 290 Executor configuration options are discussed in Section~\ref{s:executor}. 291 \end{sloppypar} 284 292 285 293 \noindent … … 288 296 \subsection{Actor Send}\label{s:ActorSend} 289 297 All message sends are done using the vertical-bar (bit-or) operator, @?|?@, similar to the syntax of the \CFA stream I/O. 290 One way to provide thisoperator is through the \CFA type system:298 One way to provide a generic operator is through the \CFA type system: 291 299 \begin{cfa} 292 300 actor & ?|?( actor &, message & ) { // base actor and message types … … 366 374 \end{figure} 367 375 368 Figure~\ref{f:ConvenienceMessages} shows three builtin convenience messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@. 376 Figure~\ref{f:PoisonPillMessages} shows three builtin \Newterm{poison-pill} messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@. 377 Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to suggest or force actor termination. 369 378 For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor. 370 379 Note, assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking … … 372 381 \begin{figure} 373 382 \begin{cfa} 374 message __base_msg_finished $@$= { .allocation_ : Finished }; 383 message __base_msg_finished $@$= { .allocation_ : Finished }; // use C initialization 375 384 struct delete_msg_t { inline message; } delete_msg = __base_msg_finished; 376 385 struct destroy_msg_t { inline message; } destroy_msg = __base_msg_finished; … … 381 390 allocation receive( actor & this, finished_msg_t & msg ) { return Finished; } 382 391 \end{cfa} 383 \caption{Builtin ConvenienceMessages}384 \label{f: ConvenienceMessages}392 \caption{Builtin Poison-Pill Messages} 393 \label{f:PoisonPillMessages} 385 394 \end{figure} 386 395 … … 390 399 After the receive routine is done, the executor must clean up the actor and message according to their allocation status. 391 400 If the allocation status is @Delete@ or @Destroy@, the appropriate destructor must be called by the executor. 392 This poses a problem;393 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. !401 This requirement poses a problem; 402 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. 394 403 This requires downcasting from the base type to the derived type, which requires a virtual system. 395 404 To accomplish the dowcast, I implemented a rudimentary destructor-only virtual system in \CFA. … … 418 427 // explicit destructor calls 419 428 ^d1{}; sout | nl; 420 ^ri{}; sout | nl;429 ^ri{}; sout | nl; 421 430 ^rb{}; sout | nl; 422 431 } // ^i, ^b … … 457 466 \end{figure} 458 467 459 While this virtual destructor system was built for this work, it is general and can be used inany type in \CFA.468 While this virtual destructor system was built for this work, it is general and can be used with any type in \CFA. 460 469 Actors and messages opt into this system by inheriting the @virtual_dtor@ type, which allows the executor to call the right destructor without knowing the derived actor or message type. 470 Again, it should be possible to seamlessly transition this workaround into any updated version of the \CFA type-system. 461 471 462 472 \section{\CFA Executor}\label{s:executor} … … 479 489 Each executor thread iterates over its own message queues until it finds one with messages. 480 490 At this point, the executor thread atomically \gls{gulp}s the queue, meaning it moves the contents of message queue to a local queue of the executor thread. 481 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread sgulps queue 0 and begins to process it locally.482 This step allows anexecutor thread to process the local queue without any atomics until the next gulp.483 Other executor threads can continue adding to the ends of executor thread's message queues.491 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread gulps queue 0 and begins to process it locally. 492 This step allows the executor thread to process the local queue without any atomics until the next gulp. 493 Other executor threads can continue adding to the ends of the executor thread's message queues. 484 494 In detail, an executor thread performs a test-and-gulp, non-atomically checking if a queue is non-empty, before attempting to gulp it. 485 495 If an executor misses an non-empty queue due to a race, it eventually finds the queue after cycling through its message queues. 486 496 This approach minimizes costly lock acquisitions. 487 497 488 Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work -unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions.498 Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions. 489 499 Since all messages to a given actor are in the same queue, this guarantees atomicity across behaviours of that actor since it can only execute on one thread at a time. 490 As each actor is created or terminated by an executor thread, it increments/decrements a global counter.500 As each actor is created or terminated by an executor thread, it atomically increments/decrements a global counter. 491 501 When an executor decrements the counter to zero, it sets a global boolean variable that is checked by each executor thread when it has no work. 492 502 Once a executor threads sees the flag is set it stops running. … … 496 506 Unfortunately, the frequent allocation of envelopes for each send results in heavy contention on the memory allocator. 497 507 This contention is reduced using a novel data structure, called a \Newterm{copy queue}. 498 The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use 508 The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use-case in mind. 499 509 A copy queue supports the typical queue operations of push/pop but in a different way from a typical array-based queue. 500 510 … … 508 518 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed. 509 519 For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations. 510 One downside of this approachthat more storage is allocated than needed, \ie each copy queue is only partially full.520 The downside of this approach is that more storage is allocated than needed, \ie each copy queue is only partially full. 511 521 Comparatively, the individual envelope allocations of a list-based queue mean that the actor system always uses the minimum amount of heap space and cleans up eagerly. 512 Additionally, bursty workloads can cause the copy queues to allocate a large amount s of space to accommodate the peaks of the throughput, even if most of that storage is not needed for the rest of the workload's execution.522 Additionally, bursty workloads can cause the copy queues to allocate a large amount of space to accommodate the throughput peak, even if most of that storage is not needed for the rest of the workload's execution. 513 523 514 524 To mitigate memory wastage, a reclamation scheme is introduced. … … 560 570 561 571 The outline for lazy-stealing by a thief is: select a victim, scan its queues once, and return immediately if a queue is stolen. 562 The thief then returns to normal operation and conducts a regular scanover its own queues looking for work, where stolen work is placed at the end of the scan.572 The thief then assumes it normal operation of scanning over its own queues looking for work, where stolen work is placed at the end of the scan. 563 573 Hence, only one victim is affected and there is a reasonable delay between stealing events as the thief scans its ready queue looking for its own work before potentially stealing again. 564 574 This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system. … … 568 578 569 579 In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker. 570 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers ,not the victim.580 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers not the victim. 571 581 The complexity in the implementation is that victim gulping does not take the mailbox queue; 572 582 rather it atomically transfers the mailbox nodes to another queue leaving the mailbox empty, as discussed in Section~\ref{s:executor}. 573 583 Hence, the original mailbox is always available for new message deliveries. 574 584 However, this transfer logically subdivides the mailbox into two separate queues, and during this period, the mailbox cannot be stolen; 575 otherwise there are two threads simultaneously running messages on a ctorsin the two parts of the mailbox queue.576 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided ,making it ineligible for stealing.585 otherwise there are two threads simultaneously running messages on an actor in the two parts of the mailbox queue. 586 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided making it ineligible for stealing. 577 587 Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue. 578 588 … … 673 683 There is a final case where the race occurs and is resolved with \emph{both} gulps occurring. 674 684 Here, the winner of the race finishes processing the queue and resets the @being_processed@ flag. 675 Then the loser unblocks and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag.685 Then the loser unblocks from its preemption and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag. 676 686 The loser is now processing messages from a temporarily shared mailbox, which is safe because the winner ignores this mailbox, if it attempts another gulp since @being_processed@ is set. 677 687 The victim never attempts to gulp from the stolen mailbox again because its next cycle sees the swapped mailbox from the thief (which may or may not be empty at this point). … … 683 693 It is straightforward to count the number of missed gulps due to the @being_processed@ flag and this counter is added to all benchmarks presented in Section~\ref{s:actor_perf}. 684 694 The results show the median count of missed gulps for each experiment is \emph{zero}, except for the repeat benchmark. 685 The repeat benchmark is an example the pathological case described earlier where there is too little work and too many workers.695 The repeat benchmark is an example of the pathological case described earlier where there is too little work and too many workers. 686 696 In the repeat benchmark, one actor has the majority of the workload, and no other actor has a consistent workload, which results in rampant stealing. 687 697 None of the work-stealing actor-systems examined in this work perform well on the repeat benchmark. … … 731 741 DCAS( x, y, x, y, y, x ); 732 742 \end{cfa} 733 A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory the same progress property problems as LL/SC.743 A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory with the same progress property problems as LL/SC. 734 744 (There is waning interest in transactional memory and it seems to be fading away.) 735 745 … … 738 748 In this case, there is a race between loading the register and performing the swap (discussed shortly). 739 749 740 Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work , and would require simulation.741 Hence, a novel swap for this use case is constructed, called \gls{dcasw}.750 Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work. 751 Hence, a novel atomic swap for this use case is simulated, called \gls{dcasw}. 742 752 The \gls{dcasw} is effectively a \gls{dcas} special cased in two ways: 743 753 \begin{enumerate} … … 804 814 } 805 815 // Step 4: Successfully swapped. 806 // Thief's ptr is 0p so no one will touchit816 // Thief's ptr is 0p so no one touches it 807 817 // Write back without CAS is safe 808 818 mailboxes[my_idx] = vic_queue; … … 849 859 \begin{itemize} 850 860 \item 851 Step 0 and 1 do not write and as suchthey cannot invalidate the invariant of any other thieves.861 Step 0 and 1 do not write, and as such, they cannot invalidate the invariant of any other thieves. 852 862 \item 853 863 In step 2, a thief attempts to write @0p@ to one of their queue pointers. … … 857 867 \item 858 868 In step 3, the thief attempts to write @my_queue@ to the victim's queue pointer. 859 If the current value of the victim's queue pointer is @0p@, then the CASfails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1.869 If the current value of the victim's queue pointer is @0p@, then the @CAS@ fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1. 860 870 Therefore, when the @CAS@ succeeds, the value of the victim's queue pointer must not be @0p@. 861 871 As such, the write never overwrites a value of @0p@, hence the invariant is held in the @CAS@ of step 3. … … 893 903 A graph of the $M$ thieves swapping with one victim discussed in this theorem is presented in Figure~\ref{f:M_one_swap}. 894 904 \\ 895 First it is important to state that a thief willnot attempt to steal from themselves.905 First it is important to state that a thief does not attempt to steal from themselves. 896 906 As such, the victim here is not also a thief. 897 Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will haveno queue pointers set to be @0p@.898 Similarly for all thieves step 2 willsucceed since no one is stealing from any of the thieves.899 In step 3 the first thief to @CAS@ will win the race and successfully swapthe queue pointer.907 Stepping through the code in \ref{f:dcaswImpl}, for all thieves, steps 0-1 succeed since the victim is not stealing and has no queue pointers set to be @0p@. 908 Similarly, for all thieves, step 2 succeed since no one is stealing from any of the thieves. 909 In step 3, the first thief to @CAS@ wins the race and successfully swaps the queue pointer. 900 910 Since it is the first one to @CAS@ and @CAS@ is atomic, there is no way for the @CAS@ to fail since no other thief could have written to the victim's queue pointer and the victim did not write to the pointer since they aren't stealing. 901 911 Hence at least one swap is guaranteed to succeed in this case. … … 929 939 Hence all thieves must successfully complete step 2 and fail at step 3. 930 940 However, since the first worker, thief $0$, is solely a victim and not a thief, it does not change the state of any of its queue pointers. 931 Hence, in this case thief $1$ will always succeedin step 3 if all thieves succeed in step 2.941 Hence, in this case thief $1$ always succeeds in step 3 if all thieves succeed in step 2. 932 942 Thus, by contradiction with the earlier assumption that no swaps occur, at least one swap must succeed. 933 943 … … 975 985 Now consider the case where all thieves successfully complete step 0-1, and then they all complete step 2. 976 986 At this point all thieves are attempting to swap with a queue pointer whose value has changed to @0p@. 977 If all thieves attempt the @CAS@ before any write backs, then they willall fail.987 If all thieves attempt the @CAS@ before any write backs, then they all fail. 978 988 Thus, by contrapositive, if the graph contains a cycle then there exists a situation where no swaps occur. 979 989 Hence, at least one swap is guaranteed to succeed if and only if the graph does not contain a cycle. … … 992 1002 The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work. 993 1003 The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker. 994 Thieves then attempt to steal from the worker with the oldest timestamp .995 The intuition behind this heuristic is that the slowest worker will receivehelp via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers.1004 Thieves then attempt to steal from the worker with the oldest timestamp, which is found by performing a linear search across the array of timestamps. 1005 The intuition behind this heuristic is that the slowest worker receives help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers. 996 1006 This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work. 997 However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim. 998 This approach consequently does increase the chance at contention among thieves; 999 however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue. 1007 A negative side-effect of this heuristic is that if multiple thieves steal at the same time, they likely steal from the same victim, which increases the chance of contention. 1008 However, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue. 1000 1009 This approach may seem counter-intuitive, but in cases with not enough work to steal, the contention among thieves can result in less stealing, due to failed swaps. 1001 1010 This can be beneficial when there is not enough work for all the stealing to be productive. 1002 This heuristic does not boast better performance than randomized victim selection, but it is comparable.1003 However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast path does not impact mainline performance, paving the way for more involved victim selection heuristics.1011 This heuristic does not boast performance over randomized victim selection, but it is comparable \see{Section~\ref{s:steal_perf}}. 1012 However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast-path does not affect mainline performance, paving the way for more involved victim selection heuristics. 1004 1013 1005 1014 % Furthermore, in the case they attempt to steal the same queue, at least one of them is guaranteed to successfully steal the queue as shown in Theorem~\ref{t:one_vic}. … … 1012 1021 1013 1022 \CFA's actor system comes with a suite of safety and productivity features. 1014 Most of these features are only present in \CFA's debug mode, and hence, have havezero-cost in nodebug mode.1023 Most of these features are only present in \CFA's debug mode, and hence, have zero-cost in nodebug mode. 1015 1024 The suit of features include the following. 1016 1025 \begin{itemize} … … 1024 1033 1025 1034 \item Actors cannot be created before the executor starts: 1026 Since the executor distributes mailbox tickets, correctness implies it must be created before anactors so it can give out the tickets.1035 Since the executor distributes mailbox tickets, correctness implies it must be created \emph{before} any actors so it can give out the tickets. 1027 1036 1028 1037 \item When an executor is configured, $M >= N$. … … 1070 1079 \end{description} 1071 1080 1072 These statistics enable a user of the \CFA's actor system to make informed choices about how to configure their executor or how to structure theiractor program.1081 These statistics enable a user to make informed choices about how to configure the executor or how to structure the actor program. 1073 1082 For example, if there are a lot of messages being stolen relative to the number of messages sent, it indicates that the workload is heavily imbalanced across executor threads. 1074 1083 Another example is if the average gulp size is very high, it indicates the executor needs more queue sharding, \ie increase $M$. 1075 1084 1076 Another productivity feature is a group of \Newterm{poison-pill} messages. 1077 Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to inform an actor to terminate. 1078 In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills. 1079 The messages that \CFA provides are @DeleteMsg@, @DestroyMsg@, and @FinishedMsg@. 1080 These messages are supported on all actor types via inheritance. 1081 These were shown earlier in Figure~\ref{f:ConvenienceMessages}, and can be overloaded by users to have specific behaviour for derived actor types. 1085 Finally, the poison-pill messages and receive routines, shown earlier in Figure~\ref{f:PoisonPillMessages}, are a convenience for programmers and can be overloaded to have specific behaviour for derived actor types. 1082 1086 1083 1087 \section{Performance}\label{s:actor_perf} 1084 1088 1085 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.1089 The performance of the \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems. 1086 1090 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions. 1087 1091 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07. … … 1102 1106 All benchmarks are run 5 times and the median is taken. 1103 1107 Error bars showing the 95\% confidence intervals appear on each point in the graphs. 1104 If the confidence bars are small enough, they may be obscured by the point.1108 If the confidence bars are small enough, they may be obscured by the data point. 1105 1109 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the \uC's actor system. 1106 1110 As such, the performance differences that arise are largely due to the contributions of this work. … … 1111 1115 Message sending is the key component of actor communication. 1112 1116 As such, latency of a single message send is the fundamental unit of fast-path performance for an actor system. 1113 The static and dynamic microbenchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send.1117 The static and dynamic send-benchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send. 1114 1118 In the static-send benchmark, a message and actor are allocated once and then the message is sent to the same actor 100 million (100M) times. 1115 1119 The average latency per message send is then calculated by dividing the duration by the number of sends. … … 1159 1163 However, Akka and ProtoActor, slow down by two-orders of magnitude. 1160 1164 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation. 1161 Tuning offthe garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.1165 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work. 1162 1166 1163 1167 \subsection{Executor}\label{s:executorPerf} 1164 1168 1165 The microbenchmarks in this section are designed to stress the executor.1169 The benchmarks in this section are designed to stress the executor. 1166 1170 The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of an actor workload. 1167 1171 Three benchmarks are run: executor, repeat, and high-memory watermark. … … 1186 1190 Figures~\ref{f:ExecutorIntel} and~\ref{f:ExecutorAMD} show the results of the AMD and Intel executor benchmark. 1187 1191 There are three groupings of results, and the difference between AMD and Intel is small. 1188 CAF is significantly slower than the other actor systems; followed by a tight grouping of uC++, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.1192 CAF is significantly slower than the other actor systems; followed by a tight grouping of \uC, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers. 1189 1193 The difference in runtime between \uC and \CFA is largely due to the copy queue described in Section~\ref{s:copyQueue}. 1190 1194 The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator. … … 1192 1196 Note, while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small; 1193 1197 hence, the relative cost for the RTTI in \uC is significant. 1194 1195 \begin{figure}1196 \centering1197 \subfloat[AMD Repeat Benchmark]{1198 \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}1199 \label{f:RepeatAMD}1200 }1201 \subfloat[Intel Repeat Benchmark]{1202 \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}1203 \label{f:RepeatIntel}1204 }1205 \caption{The repeat benchmark comparing actor systems (lower is better).}1206 \end{figure}1207 1198 1208 1199 The repeat benchmark also evaluates the executor. … … 1216 1207 The results are spread out more, and there is a difference between AMD and Intel. 1217 1208 Again, CAF is significantly slower than the other actor systems. 1218 On the AMD there is a tight grouping of uC++, ProroActor, and Akka; 1219 on the Intel, uC++, ProroActor, and Akka are spread out. 1209 To keep the graphs readable, the y-axis was cut at 100 seconds; as the core count increases from 8-32, CAF ranges around 200 seconds on AMD and between 300-1000 seconds on the Intel. 1210 On the AMD there is a tight grouping of uC++, ProtoActor, and Akka; 1211 on the Intel, uC++, ProtoActor, and Akka are spread out. 1220 1212 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel. 1221 This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather. 1222 The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}. 1223 Here, gains from using the copy queue are much less apparent, due to the costs of stealing. 1213 Here, gains from using the copy queue are much less apparent. 1214 1215 \begin{figure} 1216 \centering 1217 \subfloat[AMD Repeat Benchmark]{ 1218 \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}} 1219 \label{f:RepeatAMD} 1220 } 1221 \subfloat[Intel Repeat Benchmark]{ 1222 \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}} 1223 \label{f:RepeatIntel} 1224 } 1225 \caption{The repeat benchmark comparing actor systems (lower is better).} 1226 \end{figure} 1227 1228 Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command. 1229 \CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF. 1230 This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation. 1231 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate. 1232 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance. 1224 1233 1225 1234 \begin{table} … … 1239 1248 \end{table} 1240 1249 1241 Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command..1242 \CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.1243 This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.1244 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.1245 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.1246 1247 1250 \subsection{Matrix Multiply} 1248 1251 … … 1252 1255 X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k} 1253 1256 \end{displaymath} 1254 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system .1257 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early. 1255 1258 1256 1259 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size. 1257 1260 An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$. 1261 Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries. 1258 1262 1259 1263 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results. 1260 1264 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems. 1261 1265 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading. 1262 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will bediscussed further in Section~\ref{s:steal_perf}.1266 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}. 1263 1267 1264 1268 \begin{figure} … … 1278 1282 1279 1283 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}. 1280 In this performance section, \CFA's approach is first tested in isolation on pathological unbalanced benchmarks, then with other actor systems on general benchmarks.1284 In this performance section, \CFA's approach is first tested in isolation on a pathological unbalanced benchmark, then with other actor systems on general benchmarks. 1281 1285 1282 1286 Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA. 1283 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading the receive actors on even cores and the send actors on the odd cores.1287 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading actors only on specific cores (there is one worker per core). 1284 1288 The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds. 1285 1289 1286 1290 The balance-one benchmark loads all the work on a single core, whereas the balance-multi loads all the work on half the cores (every other core). 1287 Given this layout, one expects the ideal speedup of work stealing in the balance-one case to be $N / N - 1$ where $N$ is the number of threads.1288 In the balance-multi casethe ideal speedup is 0.5.1289 Note that in the balance-one benchmark the workload is fixed so decreasing runtime is expected.1290 In the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected.1291 Given this layout, the ideal speedup of work stealing in the balance-one case should be $N / N - 1$ where $N$ is the number of threads; 1292 in the balance-multi case, the ideal speedup is 0.5. 1293 Note, in the balance-one benchmark, the workload is fixed so decreasing runtime is expected; 1294 in the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected. 1291 1295 1292 1296 \begin{figure} … … 1316 1320 \end{figure} 1317 1321 1318 On both balance microbenchmarks slightly less than ideal speedup compared to the non stealing variation is achieved by both the random and longest victim stealing heuristics. 1319 On the balance-multi benchmark \ref{f:BalanceMultiAMD},\ref{f:BalanceMultiIntel} the random heuristic outperforms the longest victim. 1320 This is likely a result of the longest victim heuristic having a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing. 1321 Additionally, a performance cost can be observed when hyperthreading kicks in in Figure~\ref{f:BalanceMultiIntel}. 1322 1323 In the balance-one benchmark on AMD \ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache. 1324 On Intel \ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1325 Note that the non stealing variation of balance-one will slow down marginally as the cores increase due to having to create more dummy actors on the inactive cores during startup. 1322 % On both balance benchmarks, slightly less than ideal speedup compared to the non-stealing variation is achieved by both the random and longest victim stealing heuristics. 1323 1324 For the balance-one benchmark on AMD in Figure~\ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache. 1325 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1326 Here, the longest-victim and random heuristic are the same. 1327 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core. 1328 1329 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim. 1330 This result is because the longest victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing. 1331 Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}. 1326 1332 1327 1333 \begin{figure} … … 1336 1342 } 1337 1343 \caption{Executor benchmark comparing \CFA stealing heuristics (lower is better).} 1338 \end{figure} 1339 1340 When comparing the \CFA stealing heuristics in Figure~\ref{f:cfaExecutorAMD} it can be seen that the random heuristic falls slightly behind the other two, but in Figure~\ref{f:cfaExecutorIntel} the runtime of all heuristics are nearly identical to each other. 1344 \label{f:ExecutorBenchmark} 1345 \end{figure} 1346 1347 Figures~\ref{f:cfaExecutorAMD} and~\ref{f:cfaExecutorIntel} show the effects of the stealing heuristics for the executor benchmark. 1348 For the AMD, in Figure~\ref{f:cfaExecutorAMD}, the random heuristic falls slightly behind the other two, but for the Intel, in Figure~\ref{f:cfaExecutorIntel}, the runtime of all heuristics are nearly identical to each other, except after crossing the 24-core boundary. 1341 1349 1342 1350 \begin{figure} … … 1351 1359 } 1352 1360 \caption{The repeat benchmark comparing \CFA stealing heuristics (lower is better).} 1353 \end{figure} 1354 1355 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations. 1356 As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1357 If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache. 1358 This steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1359 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1360 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager. 1361 In both the executor and the repeat benchmark CAF performs poorly. 1362 It is hypothesized that CAF has an aggressive work stealing algorithm, that eagerly attempts to steal. 1363 This results in poor performance in benchmarks with small messages containing little work per message. 1364 On the other hand, in \ref{f:MatrixAMD} CAF performs much better since each message has a large amount of work, and few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster. 1361 \label{f:RepeatBenchmark} 1362 \end{figure} 1363 1364 Figures~\ref{f:cfaRepeatAMD} and~\ref{f:cfaRepeatIntel} show the effects of the stealing heuristics for the repeat benchmark. 1365 This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather. 1366 The single actor (the client) of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1367 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1368 This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1369 However, all heuristics are comparable in performance on the repeat benchmark. 1370 This result is surprising especially for the No-Stealing variant, which one would expect to have better performance than the stealing variants. 1371 This is not the case, since the stealing happens lazily and fails fast, the queue containing the long-running client actor is rarely stolen. 1372 1373 Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively. 1374 In both benchmarks, CAF performs poorly. 1375 It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal. 1376 This results in the poor performance with small messages containing little work per message in both of these benchmarks. 1377 In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing. 1378 1379 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark. 1380 Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message. 1381 1382 Stealing can still improve performance marginally in the matrix-multiply benchmark. 1383 In \ref{f:MatrixAMD} CAF performs better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster. 1365 1384 This hypothesis stems from experimentation with \CFA. 1366 1385 CAF uses a randomized work stealing heuristic. 1367 1386 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark. 1368 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message. 1369 1370 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing. 1371 The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1372 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1373 As such stealing the client can result in a hit in performance. 1374 1375 In Figures~\ref{f:cfaMatrixAMD} and \ref{f:cfaMatrixIntel} there is little negligible performance difference across \CFA stealing heuristics. 1387 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks. 1376 1388 1377 1389 \begin{figure}
Note: See TracChangeset
for help on using the changeset viewer.