Changeset ed274fe for doc/theses/colby_parsons_MMAth/text
- Timestamp:
- Jul 12, 2023, 11:39:32 AM (21 months ago)
- Branches:
- master
- Children:
- 1d9dc9c
- Parents:
- 68db00e
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶
r68db00e red274fe 5 5 % ====================================================================== 6 6 7 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment. 8 Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. 9 The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation and the \gls{actor_system}, which refers to the implementation of the model. 7 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency. 8 Hence, actors are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction. 9 Actor message-passing is similar to channels, but with more abstraction, so there is no shared data to protect, making actors amenable in a distributed environment. 10 The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation, and the \gls{actor_system}, which refers to the implementation of the model. 10 11 Before discussing \CFA's actor system in detail, it is important to first describe the actor model, and the classic approach to implementing an actor system. 11 12 … … 20 21 An actor is executed by an underlying \Newterm{executor} (kernel thread-pool) that fairly invokes each actor, where an actor invocation processes one or more messages from its mailbox. 21 22 The default number of executor threads is often proportional to the number of computer cores to achieve good performance. 22 An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{ end ofSection~\ref{s:ActorSystem}}.23 An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{Section~\ref{s:ActorSystem}}. 23 24 24 25 \subsection{Classic Actor System} … … 31 32 Some actor systems provide a shared mailbox where multiple actors receive from a common mailbox~\cite{Akka}, which is contrary to the no-sharing design of the basic actor-model (and requires additional locking). 32 33 For non-\gls{fifo} service, some notion of fairness (eventual progress) must exist, otherwise messages have a high latency or starve, \ie never received. 33 Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}} ).34 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ willarrive at actor $j$ in the order they were sent.34 Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}}. 35 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ arrive at actor $j$ in the order they were sent. 35 36 Another way an actor system varies from the model is allowing access to shared global-state. 36 37 When this occurs, it complicates the implementation as this breaks any implicit mutual-exclusion guarantees when only accessing local-state. … … 173 174 @actor | finished_msg;@ $\C{// send => terminate actor (deallocation deferred)}$ 174 175 stop_actor_system(); $\C{// waits until actors finish}\CRT$ 175 } // deallocate int_msg, str_msg, actor176 } // deallocate actor, int_msg, str_msg 176 177 \end{cfa} 177 178 \caption{\CFA Actor Syntax} … … 181 182 Figure~\ref{f:CFAActor} shows a complete \CFA actor example, which is discussed in detail. 182 183 The actor type @my_actor@ is a @struct@ that inherits from the base @actor@ @struct@ via the @inline@ keyword. 183 This inheritance style is the Plan-9 C-style inheritance discussed in Section~\ref{s:Inheritance}.184 This inheritance style is the Plan-9 C-style \see{Section~\ref{s:Inheritance}}. 184 185 Similarly, the message types @str_msg@ and @int_msg@ are @struct@s that inherits from the base @message@ @struct@ via the @inline@ keyword. 185 186 Only @str_msg@ needs a constructor to copy the C string; 186 187 @int_msg@ is initialized using its \CFA auto-generated constructors. 187 188 There are two matching @receive@ (behaviour) routines that process the corresponding typed messages. 188 Both @receive@ routines use a @with@ clause so message fields are not qualified and return @Nodelete@ indicating the actor is not finished.189 Both @receive@ routines use a @with@ clause so message fields are not qualified \see{Section~\ref{s:with}} and return @Nodelete@ indicating the actor is not finished \see{Section~\ref{s:ActorBehaviours}}. 189 190 Also, all messages are marked with @Nodelete@ as their default allocation state. 190 191 The program main begins by creating two messages on the stack. 191 Then the executor system is started by calling @start_actor_system@ .192 Now an actor is created on the stack and four messages are sent to it using operator @?|?@ .193 The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{ Section~\ref{s:ActorBehaviours}}.192 Then the executor system is started by calling @start_actor_system@ \see{Section~\ref{s:ActorSystem}}. 193 Now an actor is created on the stack and four messages are sent to it using operator @?|?@ \see{Section~\ref{s:Operators}}. 194 The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{end of Section~\ref{s:ActorBehaviours}}. 194 195 The call to @stop_actor_system@ blocks the program main until all actors are finished and removed from the actor system. 195 The program main ends by deleting the actor and t wo messages from the stack.196 The program main ends by deleting the actor and the two messages from the stack. 196 197 The output for the program is: 197 198 \begin{cfa} … … 274 275 \noindent@void start_actor_system()@ 275 276 configures the executor to implicitly use all preallocated kernel-threads (processors), \ie the processors created by the program main prior to starting the actor system. 277 For example, the program main declares at the start: 278 \begin{cfa} 279 processor p[3]; 280 \end{cfa} 281 which provides a total of 4 threads (3 + initial processor) for use by the executor. 276 282 When the number of processors is greater than 1, each executor's message queue is sharded by a factor of 16 to reduce contention, \ie for 4 executor threads (processors), there is a total of 4 $\times$ 16 message queues evenly distributed across the executor threads. 277 283 … … 279 285 configures the number of executor threads to @num_thds@, with the same message queue sharding. 280 286 287 \begin{sloppypar} 281 288 \noindent@void start_actor_system( executor & this )@ 282 289 allows the programmer to explicitly create and configure an executor for use by the actor system. 283 290 Executor configuration options are discussed in Section~\ref{s:executor}. 291 \end{sloppypar} 284 292 285 293 \noindent … … 288 296 \subsection{Actor Send}\label{s:ActorSend} 289 297 All message sends are done using the vertical-bar (bit-or) operator, @?|?@, similar to the syntax of the \CFA stream I/O. 290 One way to provide thisoperator is through the \CFA type system:298 One way to provide a generic operator is through the \CFA type system: 291 299 \begin{cfa} 292 300 actor & ?|?( actor &, message & ) { // base actor and message types … … 366 374 \end{figure} 367 375 368 Figure~\ref{f:ConvenienceMessages} shows three builtin convenience messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@. 376 Figure~\ref{f:PoisonPillMessages} shows three builtin \Newterm{poison-pill} messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@. 377 Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to suggest or force actor termination. 369 378 For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor. 370 379 Note, assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking … … 372 381 \begin{figure} 373 382 \begin{cfa} 374 message __base_msg_finished $@$= { .allocation_ : Finished }; 383 message __base_msg_finished $@$= { .allocation_ : Finished }; // use C initialization 375 384 struct delete_msg_t { inline message; } delete_msg = __base_msg_finished; 376 385 struct destroy_msg_t { inline message; } destroy_msg = __base_msg_finished; … … 381 390 allocation receive( actor & this, finished_msg_t & msg ) { return Finished; } 382 391 \end{cfa} 383 \caption{Builtin ConvenienceMessages}384 \label{f: ConvenienceMessages}392 \caption{Builtin Poison-Pill Messages} 393 \label{f:PoisonPillMessages} 385 394 \end{figure} 386 395 … … 390 399 After the receive routine is done, the executor must clean up the actor and message according to their allocation status. 391 400 If the allocation status is @Delete@ or @Destroy@, the appropriate destructor must be called by the executor. 392 This poses a problem;393 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. !401 This requirement poses a problem; 402 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor. 394 403 This requires downcasting from the base type to the derived type, which requires a virtual system. 395 404 To accomplish the dowcast, I implemented a rudimentary destructor-only virtual system in \CFA. … … 418 427 // explicit destructor calls 419 428 ^d1{}; sout | nl; 420 ^ri{}; sout | nl;429 ^ri{}; sout | nl; 421 430 ^rb{}; sout | nl; 422 431 } // ^i, ^b … … 457 466 \end{figure} 458 467 459 While this virtual destructor system was built for this work, it is general and can be used inany type in \CFA.468 While this virtual destructor system was built for this work, it is general and can be used with any type in \CFA. 460 469 Actors and messages opt into this system by inheriting the @virtual_dtor@ type, which allows the executor to call the right destructor without knowing the derived actor or message type. 470 Again, it should be possible to seamlessly transition this workaround into any updated version of the \CFA type-system. 461 471 462 472 \section{\CFA Executor}\label{s:executor} … … 479 489 Each executor thread iterates over its own message queues until it finds one with messages. 480 490 At this point, the executor thread atomically \gls{gulp}s the queue, meaning it moves the contents of message queue to a local queue of the executor thread. 481 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread sgulps queue 0 and begins to process it locally.482 This step allows anexecutor thread to process the local queue without any atomics until the next gulp.483 Other executor threads can continue adding to the ends of executor thread's message queues.491 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread gulps queue 0 and begins to process it locally. 492 This step allows the executor thread to process the local queue without any atomics until the next gulp. 493 Other executor threads can continue adding to the ends of the executor thread's message queues. 484 494 In detail, an executor thread performs a test-and-gulp, non-atomically checking if a queue is non-empty, before attempting to gulp it. 485 495 If an executor misses an non-empty queue due to a race, it eventually finds the queue after cycling through its message queues. 486 496 This approach minimizes costly lock acquisitions. 487 497 488 Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work -unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions.498 Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions. 489 499 Since all messages to a given actor are in the same queue, this guarantees atomicity across behaviours of that actor since it can only execute on one thread at a time. 490 As each actor is created or terminated by an executor thread, it increments/decrements a global counter.500 As each actor is created or terminated by an executor thread, it atomically increments/decrements a global counter. 491 501 When an executor decrements the counter to zero, it sets a global boolean variable that is checked by each executor thread when it has no work. 492 502 Once a executor threads sees the flag is set it stops running. … … 496 506 Unfortunately, the frequent allocation of envelopes for each send results in heavy contention on the memory allocator. 497 507 This contention is reduced using a novel data structure, called a \Newterm{copy queue}. 498 The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use 508 The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use-case in mind. 499 509 A copy queue supports the typical queue operations of push/pop but in a different way from a typical array-based queue. 500 510 … … 508 518 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed. 509 519 For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations. 510 One downside of this approachthat more storage is allocated than needed, \ie each copy queue is only partially full.520 The downside of this approach is that more storage is allocated than needed, \ie each copy queue is only partially full. 511 521 Comparatively, the individual envelope allocations of a list-based queue mean that the actor system always uses the minimum amount of heap space and cleans up eagerly. 512 Additionally, bursty workloads can cause the copy queues to allocate a large amount s of space to accommodate the peaks of the throughput, even if most of that storage is not needed for the rest of the workload's execution.522 Additionally, bursty workloads can cause the copy queues to allocate a large amount of space to accommodate the throughput peak, even if most of that storage is not needed for the rest of the workload's execution. 513 523 514 524 To mitigate memory wastage, a reclamation scheme is introduced. … … 560 570 561 571 The outline for lazy-stealing by a thief is: select a victim, scan its queues once, and return immediately if a queue is stolen. 562 The thief then returns to normal operation and conducts a regular scanover its own queues looking for work, where stolen work is placed at the end of the scan.572 The thief then assumes it normal operation of scanning over its own queues looking for work, where stolen work is placed at the end of the scan. 563 573 Hence, only one victim is affected and there is a reasonable delay between stealing events as the thief scans its ready queue looking for its own work before potentially stealing again. 564 574 This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system. … … 568 578 569 579 In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker. 570 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers ,not the victim.580 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers not the victim. 571 581 The complexity in the implementation is that victim gulping does not take the mailbox queue; 572 582 rather it atomically transfers the mailbox nodes to another queue leaving the mailbox empty, as discussed in Section~\ref{s:executor}. 573 583 Hence, the original mailbox is always available for new message deliveries. 574 584 However, this transfer logically subdivides the mailbox into two separate queues, and during this period, the mailbox cannot be stolen; 575 otherwise there are two threads simultaneously running messages on a ctorsin the two parts of the mailbox queue.576 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided ,making it ineligible for stealing.585 otherwise there are two threads simultaneously running messages on an actor in the two parts of the mailbox queue. 586 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided making it ineligible for stealing. 577 587 Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue. 578 588 … … 673 683 There is a final case where the race occurs and is resolved with \emph{both} gulps occurring. 674 684 Here, the winner of the race finishes processing the queue and resets the @being_processed@ flag. 675 Then the loser unblocks and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag.685 Then the loser unblocks from its preemption and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag. 676 686 The loser is now processing messages from a temporarily shared mailbox, which is safe because the winner ignores this mailbox, if it attempts another gulp since @being_processed@ is set. 677 687 The victim never attempts to gulp from the stolen mailbox again because its next cycle sees the swapped mailbox from the thief (which may or may not be empty at this point). … … 683 693 It is straightforward to count the number of missed gulps due to the @being_processed@ flag and this counter is added to all benchmarks presented in Section~\ref{s:actor_perf}. 684 694 The results show the median count of missed gulps for each experiment is \emph{zero}, except for the repeat benchmark. 685 The repeat benchmark is an example the pathological case described earlier where there is too little work and too many workers.695 The repeat benchmark is an example of the pathological case described earlier where there is too little work and too many workers. 686 696 In the repeat benchmark, one actor has the majority of the workload, and no other actor has a consistent workload, which results in rampant stealing. 687 697 None of the work-stealing actor-systems examined in this work perform well on the repeat benchmark. … … 731 741 DCAS( x, y, x, y, y, x ); 732 742 \end{cfa} 733 A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory the same progress property problems as LL/SC.743 A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory with the same progress property problems as LL/SC. 734 744 (There is waning interest in transactional memory and it seems to be fading away.) 735 745 … … 738 748 In this case, there is a race between loading the register and performing the swap (discussed shortly). 739 749 740 Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work , and would require simulation.741 Hence, a novel swap for this use case is constructed, called \gls{dcasw}.750 Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work. 751 Hence, a novel atomic swap for this use case is simulated, called \gls{dcasw}. 742 752 The \gls{dcasw} is effectively a \gls{dcas} special cased in two ways: 743 753 \begin{enumerate} … … 804 814 } 805 815 // Step 4: Successfully swapped. 806 // Thief's ptr is 0p so no one will touchit816 // Thief's ptr is 0p so no one touches it 807 817 // Write back without CAS is safe 808 818 mailboxes[my_idx] = vic_queue; … … 849 859 \begin{itemize} 850 860 \item 851 Step 0 and 1 do not write and as suchthey cannot invalidate the invariant of any other thieves.861 Step 0 and 1 do not write, and as such, they cannot invalidate the invariant of any other thieves. 852 862 \item 853 863 In step 2, a thief attempts to write @0p@ to one of their queue pointers. … … 857 867 \item 858 868 In step 3, the thief attempts to write @my_queue@ to the victim's queue pointer. 859 If the current value of the victim's queue pointer is @0p@, then the CASfails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1.869 If the current value of the victim's queue pointer is @0p@, then the @CAS@ fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1. 860 870 Therefore, when the @CAS@ succeeds, the value of the victim's queue pointer must not be @0p@. 861 871 As such, the write never overwrites a value of @0p@, hence the invariant is held in the @CAS@ of step 3. … … 893 903 A graph of the $M$ thieves swapping with one victim discussed in this theorem is presented in Figure~\ref{f:M_one_swap}. 894 904 \\ 895 First it is important to state that a thief willnot attempt to steal from themselves.905 First it is important to state that a thief does not attempt to steal from themselves. 896 906 As such, the victim here is not also a thief. 897 Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will haveno queue pointers set to be @0p@.898 Similarly for all thieves step 2 willsucceed since no one is stealing from any of the thieves.899 In step 3 the first thief to @CAS@ will win the race and successfully swapthe queue pointer.907 Stepping through the code in \ref{f:dcaswImpl}, for all thieves, steps 0-1 succeed since the victim is not stealing and has no queue pointers set to be @0p@. 908 Similarly, for all thieves, step 2 succeed since no one is stealing from any of the thieves. 909 In step 3, the first thief to @CAS@ wins the race and successfully swaps the queue pointer. 900 910 Since it is the first one to @CAS@ and @CAS@ is atomic, there is no way for the @CAS@ to fail since no other thief could have written to the victim's queue pointer and the victim did not write to the pointer since they aren't stealing. 901 911 Hence at least one swap is guaranteed to succeed in this case. … … 929 939 Hence all thieves must successfully complete step 2 and fail at step 3. 930 940 However, since the first worker, thief $0$, is solely a victim and not a thief, it does not change the state of any of its queue pointers. 931 Hence, in this case thief $1$ will always succeedin step 3 if all thieves succeed in step 2.941 Hence, in this case thief $1$ always succeeds in step 3 if all thieves succeed in step 2. 932 942 Thus, by contradiction with the earlier assumption that no swaps occur, at least one swap must succeed. 933 943 … … 975 985 Now consider the case where all thieves successfully complete step 0-1, and then they all complete step 2. 976 986 At this point all thieves are attempting to swap with a queue pointer whose value has changed to @0p@. 977 If all thieves attempt the @CAS@ before any write backs, then they willall fail.987 If all thieves attempt the @CAS@ before any write backs, then they all fail. 978 988 Thus, by contrapositive, if the graph contains a cycle then there exists a situation where no swaps occur. 979 989 Hence, at least one swap is guaranteed to succeed if and only if the graph does not contain a cycle. … … 993 1003 The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker. 994 1004 Thieves then attempt to steal from the worker with the oldest timestamp. 995 The intuition behind this heuristic is that the slowest worker will receive help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers. 1005 \PAB{How does a thief find the oldest timestamp?} 1006 The intuition behind this heuristic is that the slowest worker receives help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers. 996 1007 This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work. 997 However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim. 998 This approach consequently does increase the chance at contention among thieves; 999 however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue. 1008 A negative side-effect of this heuristic is that if multiple thieves steal at the same time, they likely steal from the same victim, which increases the chance of contention. 1009 However, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue. 1000 1010 This approach may seem counter-intuitive, but in cases with not enough work to steal, the contention among thieves can result in less stealing, due to failed swaps. 1001 1011 This can be beneficial when there is not enough work for all the stealing to be productive. 1002 This heuristic does not boast better performance than randomized victim selection, but it is comparable.1003 However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast path does not impact mainline performance, paving the way for more involved victim selection heuristics.1012 This heuristic does not boast performance over randomized victim selection, but it is comparable \see{Section~\ref{s:steal_perf}}. 1013 However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast-path does not affect mainline performance, paving the way for more involved victim selection heuristics. 1004 1014 1005 1015 % Furthermore, in the case they attempt to steal the same queue, at least one of them is guaranteed to successfully steal the queue as shown in Theorem~\ref{t:one_vic}. … … 1012 1022 1013 1023 \CFA's actor system comes with a suite of safety and productivity features. 1014 Most of these features are only present in \CFA's debug mode, and hence, have havezero-cost in nodebug mode.1024 Most of these features are only present in \CFA's debug mode, and hence, have zero-cost in nodebug mode. 1015 1025 The suit of features include the following. 1016 1026 \begin{itemize} … … 1024 1034 1025 1035 \item Actors cannot be created before the executor starts: 1026 Since the executor distributes mailbox tickets, correctness implies it must be created before anactors so it can give out the tickets.1036 Since the executor distributes mailbox tickets, correctness implies it must be created \emph{before} any actors so it can give out the tickets. 1027 1037 1028 1038 \item When an executor is configured, $M >= N$. … … 1070 1080 \end{description} 1071 1081 1072 These statistics enable a user of the \CFA's actor system to make informed choices about how to configure their executor or how to structure theiractor program.1082 These statistics enable a user to make informed choices about how to configure the executor or how to structure the actor program. 1073 1083 For example, if there are a lot of messages being stolen relative to the number of messages sent, it indicates that the workload is heavily imbalanced across executor threads. 1074 1084 Another example is if the average gulp size is very high, it indicates the executor needs more queue sharding, \ie increase $M$. 1075 1085 1076 Another productivity feature is a group of \Newterm{poison-pill} messages. 1077 Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to inform an actor to terminate. 1078 In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills. 1079 The messages that \CFA provides are @DeleteMsg@, @DestroyMsg@, and @FinishedMsg@. 1080 These messages are supported on all actor types via inheritance. 1081 These were shown earlier in Figure~\ref{f:ConvenienceMessages}, and can be overloaded by users to have specific behaviour for derived actor types. 1086 Finally, the poison-pill messages and receive routines, shown earlier in Figure~\ref{f:PoisonPillMessages}, are a convenience for programmers and can be overloaded to have specific behaviour for derived actor types. 1082 1087 1083 1088 \section{Performance}\label{s:actor_perf} 1084 1089 1085 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.1090 The performance of the \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems. 1086 1091 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions. 1087 1092 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07. … … 1102 1107 All benchmarks are run 5 times and the median is taken. 1103 1108 Error bars showing the 95\% confidence intervals appear on each point in the graphs. 1104 If the confidence bars are small enough, they may be obscured by the point.1109 If the confidence bars are small enough, they may be obscured by the data point. 1105 1110 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the \uC's actor system. 1106 1111 As such, the performance differences that arise are largely due to the contributions of this work. … … 1111 1116 Message sending is the key component of actor communication. 1112 1117 As such, latency of a single message send is the fundamental unit of fast-path performance for an actor system. 1113 The static and dynamic microbenchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send.1118 The static and dynamic send-benchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send. 1114 1119 In the static-send benchmark, a message and actor are allocated once and then the message is sent to the same actor 100 million (100M) times. 1115 1120 The average latency per message send is then calculated by dividing the duration by the number of sends. … … 1159 1164 However, Akka and ProtoActor, slow down by two-orders of magnitude. 1160 1165 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation. 1161 Tuning offthe garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.1166 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work. 1162 1167 1163 1168 \subsection{Executor}\label{s:executorPerf} 1164 1169 1165 The microbenchmarks in this section are designed to stress the executor.1170 The benchmarks in this section are designed to stress the executor. 1166 1171 The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of an actor workload. 1167 1172 Three benchmarks are run: executor, repeat, and high-memory watermark. … … 1186 1191 Figures~\ref{f:ExecutorIntel} and~\ref{f:ExecutorAMD} show the results of the AMD and Intel executor benchmark. 1187 1192 There are three groupings of results, and the difference between AMD and Intel is small. 1188 CAF is significantly slower than the other actor systems; followed by a tight grouping of uC++, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.1193 CAF is significantly slower than the other actor systems; followed by a tight grouping of \uC, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers. 1189 1194 The difference in runtime between \uC and \CFA is largely due to the copy queue described in Section~\ref{s:copyQueue}. 1190 1195 The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator. … … 1192 1197 Note, while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small; 1193 1198 hence, the relative cost for the RTTI in \uC is significant. 1194 1195 \begin{figure}1196 \centering1197 \subfloat[AMD Repeat Benchmark]{1198 \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}1199 \label{f:RepeatAMD}1200 }1201 \subfloat[Intel Repeat Benchmark]{1202 \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}1203 \label{f:RepeatIntel}1204 }1205 \caption{The repeat benchmark comparing actor systems (lower is better).}1206 \end{figure}1207 1199 1208 1200 The repeat benchmark also evaluates the executor. … … 1222 1214 The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}. 1223 1215 Here, gains from using the copy queue are much less apparent, due to the costs of stealing. 1216 1217 \begin{figure} 1218 \centering 1219 \subfloat[AMD Repeat Benchmark]{ 1220 \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}} 1221 \label{f:RepeatAMD} 1222 } 1223 \subfloat[Intel Repeat Benchmark]{ 1224 \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}} 1225 \label{f:RepeatIntel} 1226 } 1227 \caption{The repeat benchmark comparing actor systems (lower is better).} 1228 \end{figure} 1229 1230 Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command. 1231 \CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF. 1232 This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation. 1233 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate. 1234 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance. 1224 1235 1225 1236 \begin{table} … … 1239 1250 \end{table} 1240 1251 1241 Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command..1242 \CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.1243 This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.1244 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.1245 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.1246 1247 1252 \subsection{Matrix Multiply} 1248 1253 … … 1252 1257 X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k} 1253 1258 \end{displaymath} 1254 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system .1259 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early. 1255 1260 1256 1261 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size. 1257 1262 An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$. 1263 Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries. 1258 1264 1259 1265 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results. 1260 1266 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems. 1261 1267 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading. 1262 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will bediscussed further in Section~\ref{s:steal_perf}.1268 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}. 1263 1269 1264 1270 \begin{figure} … … 1278 1284 1279 1285 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}. 1280 In this performance section, \CFA's approach is first tested in isolation on pathological unbalanced benchmarks, then with other actor systems on general benchmarks.1286 In this performance section, \CFA's approach is first tested in isolation on a pathological unbalanced benchmark, then with other actor systems on general benchmarks. 1281 1287 1282 1288 Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA. 1283 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading the receive actors on even cores and the send actors on the oddcores.1289 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by isolating the receive and send actors on different cores. 1284 1290 The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds. 1285 1291 1286 1292 The balance-one benchmark loads all the work on a single core, whereas the balance-multi loads all the work on half the cores (every other core). 1287 Given this layout, one expects the ideal speedup of work stealing in the balance-one case to be $N / N - 1$ where $N$ is the number of threads.1288 In the balance-multi casethe ideal speedup is 0.5.1289 Note that in the balance-one benchmark the workload is fixed so decreasing runtime is expected.1290 In the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected.1293 Given this layout, the ideal speedup of work stealing in the balance-one case should be $N / N - 1$ where $N$ is the number of threads; 1294 in the balance-multi case, the ideal speedup is 0.5. 1295 Note, in the balance-one benchmark, the workload is fixed so decreasing runtime is expected; 1296 in the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected. 1291 1297 1292 1298 \begin{figure} … … 1316 1322 \end{figure} 1317 1323 1318 On both balance microbenchmarks slightly less than ideal speedup compared to the non stealing variation is achieved by both the random and longest victim stealing heuristics. 1319 On the balance-multi benchmark \ref{f:BalanceMultiAMD},\ref{f:BalanceMultiIntel} the random heuristic outperforms the longest victim. 1320 This is likely a result of the longest victim heuristic having a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing. 1321 Additionally, a performance cost can be observed when hyperthreading kicks in in Figure~\ref{f:BalanceMultiIntel}. 1322 1323 In the balance-one benchmark on AMD \ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache. 1324 On Intel \ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1325 Note that the non stealing variation of balance-one will slow down marginally as the cores increase due to having to create more dummy actors on the inactive cores during startup. 1324 % On both balance benchmarks, slightly less than ideal speedup compared to the non-stealing variation is achieved by both the random and longest victim stealing heuristics. 1325 1326 For the balance-one benchmark on AMD in Figure~\ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache. 1327 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1328 Here, the longest-victim and random heuristic are the same. 1329 Note, the non-stealing variation of balance-one slows down (no decrease in graph) as the cores increase \PAB{not sure I understand this part: due to having to create more dummy actors on the inactive cores during startup}. 1330 1331 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim. 1332 This result is because the longest victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing. 1333 Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}. 1326 1334 1327 1335 \begin{figure} … … 1338 1346 \end{figure} 1339 1347 1340 When comparing the \CFA stealing heuristics in Figure~\ref{f:cfaExecutorAMD} it can be seen that the random heuristic falls slightly behind the other two, but in Figure~\ref{f:cfaExecutorIntel} the runtime of all heuristics are nearly identical to each other. 1348 Figures~\ref{f:cfaExecutorAMD} and~\ref{f:cfaExecutorIntel} show the effects of the stealing heuristics for the executor benchmark. 1349 For the AMD, in Figure~\ref{f:cfaExecutorAMD}, the random heuristic falls slightly behind the other two, but for the Intel, in Figure~\ref{f:cfaExecutorIntel}, the runtime of all heuristics are nearly identical to each other, except after crossing the 24-core boundary. 1341 1350 1342 1351 \begin{figure} … … 1353 1362 \end{figure} 1354 1363 1355 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations. 1356 As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1357 If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache. 1358 This steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1364 \PAB{Something is wrong here because there are no graphs containing all the benchmarks. 1365 Figures~\ref{f:cfaRepeatAMD} and~\ref{f:cfaRepeatIntel} show the effects of the stealing heuristics for the repeat benchmark. 1366 % Here, the no-stealing version of \CFA performs better than both stealing variations. 1367 As mentioned, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1368 The worst-case scenario is if the actor doing the majority of work or its mail queue is stolen by the work stealing system, as this a huge cost to move the work and refill the local cache. 1369 This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1370 1359 1371 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1360 1372 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager. … … 1371 1383 The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1372 1384 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1373 As such stealing the client can result in a hit in performance. 1374 1375 In Figures~\ref{f:cfaMatrixAMD} and \ref{f:cfaMatrixIntel} there is little negligible performance difference across \CFA stealing heuristics. 1385 As such stealing the client can result in a hit in performance.} 1386 1387 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiple benchmark. 1388 Here, there is negligible performance difference across stealing heuristics. 1376 1389 1377 1390 \begin{figure}
Note: See TracChangeset
for help on using the changeset viewer.