Context Navigation

← Previous Change
Next Change →

text

Timestamp:

Jul 13, 2023, 9:37:22 PM (22 months ago)

Author:

Peter A. Buhr <pabuhr@…>

Branches:

master

Children:

b7c53a9d

Parents:

09e400e (diff), a3c7bac (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:

doc/theses/colby_parsons_MMAth/text

Files:

: 2 edited

CFA_intro.tex (modified) (2 diffs)
actors.tex (modified) (45 diffs)

Legend:

: Unmodified
: Added
: Removed

TabularUnified doc/theses/colby_parsons_MMAth/text/CFA_intro.tex ¶

-                      r09e400e
+                      r4acf56d
 \section{\lstinline{with} Statement}
+\section{\lstinline{with} Statement}\label{s:with}
 The \CFA @with@ statement is for exposing fields of an aggregate type within a scope, allowing field names without qualification.
 This feature is also implemented in Pascal~\cite{Pascal}.
 …
 \section{Operators}
+\section{Operators}\label{s:Operators}
 Operators can be overloaded in \CFA with operator routines.
 Operators in \CFA are named using an operator symbol and '@?@' to represent operands.

TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶

-                      r09e400e
+                      r4acf56d
 % ======================================================================
+Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment.
+Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction.
+The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation and the \gls{actor_system}, which refers to the implementation of the model.
+Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency.
+Hence, actors are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction.
+Actor message-passing is similar to channels, but with more abstraction, so there is no shared data to protect, making actors amenable in a distributed environment.
+The study of actors can be broken into two concepts, the \gls{actor_model}, which describes the model of computation, and the \gls{actor_system}, which refers to the implementation of the model.
 Before discussing \CFA's actor system in detail, it is important to first describe the actor model, and the classic approach to implementing an actor system.
 …
 An actor is executed by an underlying \Newterm{executor} (kernel thread-pool) that fairly invokes each actor, where an actor invocation processes one or more messages from its mailbox.
 The default number of executor threads is often proportional to the number of computer cores to achieve good performance.
 An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{end of Section~\ref{s:ActorSystem}}.
+An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{Section~\ref{s:ActorSystem}}.
 \subsection{Classic Actor System}
 …
 Some actor systems provide a shared mailbox where multiple actors receive from a common mailbox~\cite{Akka}, which is contrary to the no-sharing design of the basic actor-model (and requires additional locking).
 For non-\gls{fifo} service, some notion of fairness (eventual progress) must exist, otherwise messages have a high latency or starve, \ie never received.
 Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}}).
 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ will arrive at actor $j$ in the order they were sent.
+Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}}.
+%While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ arrive at actor $j$ in the order they were sent.
 Another way an actor system varies from the model is allowing access to shared global-state.
 When this occurs, it complicates the implementation as this breaks any implicit mutual-exclusion guarantees when only accessing local-state.
 …
         @actor | finished_msg;@                                 $\C{// send => terminate actor (deallocation deferred)}$
         stop_actor_system();                                    $\C{// waits until actors finish}\CRT$
 } // deallocate int_msg, str_msg, actor
+} // deallocate actor, int_msg, str_msg
 \end{cfa}
 \caption{\CFA Actor Syntax}
 …
 Figure~\ref{f:CFAActor} shows a complete \CFA actor example, which is discussed in detail.
 The actor type @my_actor@ is a @struct@ that inherits from the base @actor@ @struct@ via the @inline@ keyword.
 This inheritance style is the Plan-9 C-style inheritance discussed in Section~\ref{s:Inheritance}.
+This inheritance style is the Plan-9 C-style \see{Section~\ref{s:Inheritance}}.
 Similarly, the message types @str_msg@ and @int_msg@ are @struct@s that inherits from the base @message@ @struct@ via the @inline@ keyword.
 Only @str_msg@ needs a constructor to copy the C string;
 @int_msg@ is initialized using its \CFA auto-generated constructors.
 There are two matching @receive@ (behaviour) routines that process the corresponding typed messages.
 Both @receive@ routines use a @with@ clause so message fields are not qualified and return @Nodelete@ indicating the actor is not finished.
+Both @receive@ routines use a @with@ clause so message fields are not qualified \see{Section~\ref{s:with}} and return @Nodelete@ indicating the actor is not finished \see{Section~\ref{s:ActorBehaviours}}.
 Also, all messages are marked with @Nodelete@ as their default allocation state.
 The program main begins by creating two messages on the stack.
 Then the executor system is started by calling @start_actor_system@.
 Now an actor is created on the stack and four messages are sent to it using operator @?|?@.
 The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{Section~\ref{s:ActorBehaviours}}.
+Then the executor system is started by calling @start_actor_system@ \see{Section~\ref{s:ActorSystem}}.
+Now an actor is created on the stack and four messages are sent to it using operator @?|?@ \see{Section~\ref{s:Operators}}.
+The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{end of Section~\ref{s:ActorBehaviours}}.
 The call to @stop_actor_system@ blocks the program main until all actors are finished and removed from the actor system.
 The program main ends by deleting the actor and two messages from the stack.
+The program main ends by deleting the actor and the two messages from the stack.
 The output for the program is:
 \begin{cfa}
 …
 \noindent@void start_actor_system()@
 configures the executor to implicitly use all preallocated kernel-threads (processors), \ie the processors created by the program main prior to starting the actor system.
+For example, the program main declares at the start:
+\begin{cfa}
+processor p[3];
+\end{cfa}
+which provides a total of 4 threads (3 + initial processor) for use by the executor.
 When the number of processors is greater than 1, each executor's message queue is sharded by a factor of 16 to reduce contention, \ie for 4 executor threads (processors), there is a total of 4 $\times$ 16 message queues evenly distributed across the executor threads.
 …
 configures the number of executor threads to @num_thds@, with the same message queue sharding.
+\begin{sloppypar}
 \noindent@void start_actor_system( executor & this )@
 allows the programmer to explicitly create and configure an executor for use by the actor system.
 Executor configuration options are discussed in Section~\ref{s:executor}.
+\end{sloppypar}
 \noindent
 …
 \subsection{Actor Send}\label{s:ActorSend}
 All message sends are done using the vertical-bar (bit-or) operator, @?|?@, similar to the syntax of the \CFA stream I/O.
 One way to provide this operator is through the \CFA type system:
+One way to provide a generic operator is through the \CFA type system:
 \begin{cfa}
 actor & ?|?( actor &, message & ) { // base actor and message types
 …
 \end{figure}
+Figure~\ref{f:ConvenienceMessages} shows three builtin convenience messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@.
+Figure~\ref{f:PoisonPillMessages} shows three builtin \Newterm{poison-pill} messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@.
+Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to suggest or force actor termination.
 For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor.
 Note, assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking
 …
 \begin{figure}
 \begin{cfa}
 message __base_msg_finished $@$= { .allocation_ : Finished };
+message __base_msg_finished $@$= { .allocation_ : Finished }; // use C initialization
 struct delete_msg_t { inline message; } delete_msg = __base_msg_finished;
 struct destroy_msg_t { inline message; } destroy_msg = __base_msg_finished;
 …
 allocation receive( actor & this, finished_msg_t & msg ) { return Finished; }
 \end{cfa}
 \caption{Builtin Convenience Messages}
 \label{f:ConvenienceMessages}
+\caption{Builtin Poison-Pill Messages}
+\label{f:PoisonPillMessages}
 \end{figure}
 …
 After the receive routine is done, the executor must clean up the actor and message according to their allocation status.
 If the allocation status is @Delete@ or @Destroy@, the appropriate destructor must be called by the executor.
 This poses a problem;
 the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor.!
+This requirement poses a problem;
+the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor.
 This requires downcasting from the base type to the derived type, which requires a virtual system.
 To accomplish the dowcast, I implemented a rudimentary destructor-only virtual system in \CFA.
 …
         // explicit destructor calls
         ^d1{};  sout | nl;
         ^ri{};  sout | nl;
+        ^ri{};   sout | nl;
         ^rb{};  sout | nl;
 } // ^i, ^b
 …
 \end{figure}
 While this virtual destructor system was built for this work, it is general and can be used in any type in \CFA.
+While this virtual destructor system was built for this work, it is general and can be used with any type in \CFA.
 Actors and messages opt into this system by inheriting the @virtual_dtor@ type, which allows the executor to call the right destructor without knowing the derived actor or message type.
+Again, it should be possible to seamlessly transition this workaround into any updated version of the \CFA type-system.
 \section{\CFA Executor}\label{s:executor}
 …
 Each executor thread iterates over its own message queues until it finds one with messages.
 At this point, the executor thread atomically \gls{gulp}s the queue, meaning it moves the contents of message queue to a local queue of the executor thread.
 An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor threads gulps queue 0 and begins to process it locally.
 This step allows an executor thread to process the local queue without any atomics until the next gulp.
 Other executor threads can continue adding to the ends of executor thread's message queues.
+An example of the queue gulping operation is shown in the right side of Figure \ref{f:gulp}, where a executor thread gulps queue 0 and begins to process it locally.
+This step allows the executor thread to process the local queue without any atomics until the next gulp.
+Other executor threads can continue adding to the ends of the executor thread's message queues.
 In detail, an executor thread performs a test-and-gulp, non-atomically checking if a queue is non-empty, before attempting to gulp it.
 If an executor misses an non-empty queue due to a race, it eventually finds the queue after cycling through its message queues.
 This approach minimizes costly lock acquisitions.
 Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work-unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions.
+Processing a local queue involves: removing a unit of work from the queue, dereferencing the actor pointed-to by the work unit, running the actor's behaviour on the work-unit message, examining the returned allocation status from the @receive@ routine for the actor and internal status in the delivered message, and taking the appropriate actions.
 Since all messages to a given actor are in the same queue, this guarantees atomicity across behaviours of that actor since it can only execute on one thread at a time.
 As each actor is created or terminated by an executor thread, it increments/decrements a global counter.
+As each actor is created or terminated by an executor thread, it atomically increments/decrements a global counter.
 When an executor decrements the counter to zero, it sets a global boolean variable that is checked by each executor thread when it has no work.
 Once a executor threads sees the flag is set it stops running.
 …
 Unfortunately, the frequent allocation of envelopes for each send results in heavy contention on the memory allocator.
 This contention is reduced using a novel data structure, called a \Newterm{copy queue}.
 The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use case in mind.
+The copy queue is a thin layer over a dynamically sized array that is designed with the envelope use-case in mind.
 A copy queue supports the typical queue operations of push/pop but in a different way from a typical array-based queue.
 …
 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed.
 For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations.
 One downside of this approach that more storage is allocated than needed, \ie each copy queue is only partially full.
+The downside of this approach is that more storage is allocated than needed, \ie each copy queue is only partially full.
 Comparatively, the individual envelope allocations of a list-based queue mean that the actor system always uses the minimum amount of heap space and cleans up eagerly.
 Additionally, bursty workloads can cause the copy queues to allocate a large amounts of space to accommodate the peaks of the throughput, even if most of that storage is not needed for the rest of the workload's execution.
+Additionally, bursty workloads can cause the copy queues to allocate a large amount of space to accommodate the throughput peak, even if most of that storage is not needed for the rest of the workload's execution.
 To mitigate memory wastage, a reclamation scheme is introduced.
 …
 The outline for lazy-stealing by a thief is: select a victim, scan its queues once, and return immediately if a queue is stolen.
 The thief then returns to normal operation and conducts a regular scan over its own queues looking for work, where stolen work is placed at the end of the scan.
+The thief then assumes it normal operation of scanning over its own queues looking for work, where stolen work is placed at the end of the scan.
 Hence, only one victim is affected and there is a reasonable delay between stealing events as the thief scans its ready queue looking for its own work before potentially stealing again.
 This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system.
 …
 In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers, not the victim.
+Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers not the victim.
 The complexity in the implementation is that victim gulping does not take the mailbox queue;
 rather it atomically transfers the mailbox nodes to another queue leaving the mailbox empty, as discussed in Section~\ref{s:executor}.
 Hence, the original mailbox is always available for new message deliveries.
 However, this transfer logically subdivides the mailbox into two separate queues, and during this period, the mailbox cannot be stolen;
 otherwise there are two threads simultaneously running messages on actors in the two parts of the mailbox queue.
 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided, making it ineligible for stealing.
+otherwise there are two threads simultaneously running messages on an actor in the two parts of the mailbox queue.
+To solve this problem, an atomic gulp also marks the mailbox queue as subdivided making it ineligible for stealing.
 Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue.
 …
 There is a final case where the race occurs and is resolved with \emph{both} gulps occurring.
 Here, the winner of the race finishes processing the queue and resets the @being_processed@ flag.
 Then the loser unblocks and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag.
+Then the loser unblocks from its preemption and completes its gulp from the same mailbox and atomically sets the \snake{being_processed} flag.
 The loser is now processing messages from a temporarily shared mailbox, which is safe because the winner ignores this mailbox, if it attempts another gulp since @being_processed@ is set.
 The victim never attempts to gulp from the stolen mailbox again because its next cycle sees the swapped mailbox from the thief (which may or may not be empty at this point).
 …
 It is straightforward to count the number of missed gulps due to the @being_processed@ flag and this counter is added to all benchmarks presented in Section~\ref{s:actor_perf}.
 The results show the median count of missed gulps for each experiment is \emph{zero}, except for the repeat benchmark.
 The repeat benchmark is an example the pathological case described earlier where there is too little work and too many workers.
+The repeat benchmark is an example of the pathological case described earlier where there is too little work and too many workers.
 In the repeat benchmark, one actor has the majority of the workload, and no other actor has a consistent workload, which results in rampant stealing.
 None of the work-stealing actor-systems examined in this work perform well on the repeat benchmark.
 …
 DCAS( x, y, x, y, y, x );
 \end{cfa}
 A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory the same progress property problems as LL/SC.
+A restrictive form of \gls{dcas} can be simulated using \gls{ll}/\gls{sc}~\cite{Brown13} or more expensive transactional memory with the same progress property problems as LL/SC.
 (There is waning interest in transactional memory and it seems to be fading away.)
 …
 In this case, there is a race between loading the register and performing the swap (discussed shortly).
 Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work, and would require simulation.
 Hence, a novel swap for this use case is constructed, called \gls{dcasw}.
+Either a true memory/memory swap instruction or a \gls{dcas} would provide the ability to atomically swap two memory locations, but unfortunately neither of these instructions are supported on the architectures used in this work.
+Hence, a novel atomic swap for this use case is simulated, called \gls{dcasw}.
 The \gls{dcasw} is effectively a \gls{dcas} special cased in two ways:
 \begin{enumerate}
 …
+        }
         // Step 4: Successfully swapped.
         // Thief's ptr is 0p so no one will touch it
+        // Thief's ptr is 0p so no one touches it
         // Write back without CAS is safe
         mailboxes[my_idx] = vic_queue;
 …
 \begin{itemize}
 \item
 Step 0 and 1 do not write and as such they cannot invalidate the invariant of any other thieves.
+Step 0 and 1 do not write, and as such, they cannot invalidate the invariant of any other thieves.
 \item
 In step 2, a thief attempts to write @0p@ to one of their queue pointers.
 …
 \item
 In step 3, the thief attempts to write @my_queue@ to the victim's queue pointer.
 If the current value of the victim's queue pointer is @0p@, then the CAS fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1.
+If the current value of the victim's queue pointer is @0p@, then the @CAS@ fails since @vic_queue@ cannot be equal to @0p@ because of the check in step 1.
 Therefore, when the @CAS@ succeeds, the value of the victim's queue pointer must not be @0p@.
 As such, the write never overwrites a value of @0p@, hence the invariant is held in the @CAS@ of step 3.
 …
 A graph of the $M$ thieves swapping with one victim discussed in this theorem is presented in Figure~\ref{f:M_one_swap}.
 \\
 First it is important to state that a thief will not attempt to steal from themselves.
+First it is important to state that a thief does not attempt to steal from themselves.
 As such, the victim here is not also a thief.
 Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.
 Similarly for all thieves step 2 will succeed since no one is stealing from any of the thieves.
 In step 3 the first thief to @CAS@ will win the race and successfully swap the queue pointer.
+Stepping through the code in \ref{f:dcaswImpl}, for all thieves, steps 0-1 succeed since the victim is not stealing and has no queue pointers set to be @0p@.
+Similarly, for all thieves, step 2 succeed since no one is stealing from any of the thieves.
+In step 3, the first thief to @CAS@ wins the race and successfully swaps the queue pointer.
 Since it is the first one to @CAS@ and @CAS@ is atomic, there is no way for the @CAS@ to fail since no other thief could have written to the victim's queue pointer and the victim did not write to the pointer since they aren't stealing.
 Hence at least one swap is guaranteed to succeed in this case.
 …
 Hence all thieves must successfully complete step 2 and fail at step 3.
 However, since the first worker, thief $0$, is solely a victim and not a thief, it does not change the state of any of its queue pointers.
 Hence, in this case thief $1$ will always succeed in step 3 if all thieves succeed in step 2.
+Hence, in this case thief $1$ always succeeds in step 3 if all thieves succeed in step 2.
 Thus, by contradiction with the earlier assumption that no swaps occur, at least one swap must succeed.
 …
 Now consider the case where all thieves successfully complete step 0-1, and then they all complete step 2.
 At this point all thieves are attempting to swap with a queue pointer whose value has changed to @0p@.
 If all thieves attempt the @CAS@ before any write backs, then they will all fail.
+If all thieves attempt the @CAS@ before any write backs, then they all fail.
 Thus, by contrapositive, if the graph contains a cycle then there exists a situation where no swaps occur.
 Hence, at least one swap is guaranteed to succeed if and only if the graph does not contain a cycle.
 …
 The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work.
 The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker.
 Thieves then attempt to steal from the worker with the oldest timestamp.
 The intuition behind this heuristic is that the slowest worker will receive help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers.
+Thieves then attempt to steal from the worker with the oldest timestamp, which is found by performing a linear search across the array of timestamps.
+The intuition behind this heuristic is that the slowest worker receives help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers.
 This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work.
+However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim.
+This approach consequently does increase the chance at contention among thieves;
+however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue.
+A negative side-effect of this heuristic is that if multiple thieves steal at the same time, they likely steal from the same victim, which increases the chance of contention.
+However, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue.
 This approach may seem counter-intuitive, but in cases with not enough work to steal, the contention among thieves can result in less stealing, due to failed swaps.
 This can be beneficial when there is not enough work for all the stealing to be productive.
 This heuristic does not boast better performance than randomized victim selection, but it is comparable.
 However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast path does not impact mainline performance, paving the way for more involved victim selection heuristics.
+This heuristic does not boast performance over randomized victim selection, but it is comparable \see{Section~\ref{s:steal_perf}}.
+However, it constitutes an interesting contribution as it shows that adding some complexity to the heuristic of the stealing fast-path does not affect mainline performance, paving the way for more involved victim selection heuristics.
 % Furthermore, in the case they attempt to steal the same queue, at least one of them is guaranteed to successfully steal the queue as shown in Theorem~\ref{t:one_vic}.
 …
 \CFA's actor system comes with a suite of safety and productivity features.
 Most of these features are only present in \CFA's debug mode, and hence, have have zero-cost in nodebug mode.
+Most of these features are only present in \CFA's debug mode, and hence, have zero-cost in nodebug mode.
 The suit of features include the following.
 \begin{itemize}
 …
 \item Actors cannot be created before the executor starts:
 Since the executor distributes mailbox tickets, correctness implies it must be created before an actors so it can give out the tickets.
+Since the executor distributes mailbox tickets, correctness implies it must be created \emph{before} any actors so it can give out the tickets.
 \item When an executor is configured, $M >= N$.
 …
 \end{description}
 These statistics enable a user of the \CFA's actor system to make informed choices about how to configure their executor or how to structure their actor program.
+These statistics enable a user to make informed choices about how to configure the executor or how to structure the actor program.
 For example, if there are a lot of messages being stolen relative to the number of messages sent, it indicates that the workload is heavily imbalanced across executor threads.
 Another example is if the average gulp size is very high, it indicates the executor needs more queue sharding, \ie increase $M$.
+Another productivity feature is a group of \Newterm{poison-pill} messages.
+Poison-pill messages are common across actor systems, including Akka and ProtoActor~\cite{Akka,ProtoActor} to inform an actor to terminate.
+In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills.
+The messages that \CFA provides are @DeleteMsg@, @DestroyMsg@, and @FinishedMsg@.
+These messages are supported on all actor types via inheritance.
+These were shown earlier in Figure~\ref{f:ConvenienceMessages}, and can be overloaded by users to have specific behaviour for derived actor types.
+Finally, the poison-pill messages and receive routines, shown earlier in Figure~\ref{f:PoisonPillMessages}, are a convenience for programmers and can be overloaded to have specific behaviour for derived actor types.
 \section{Performance}\label{s:actor_perf}
 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
+The performance of the \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions.
 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07.
 …
 All benchmarks are run 5 times and the median is taken.
 Error bars showing the 95\% confidence intervals appear on each point in the graphs.
 If the confidence bars are small enough, they may be obscured by the point.
+If the confidence bars are small enough, they may be obscured by the data point.
 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the \uC's actor system.
 As such, the performance differences that arise are largely due to the contributions of this work.
 …
 Message sending is the key component of actor communication.
 As such, latency of a single message send is the fundamental unit of fast-path performance for an actor system.
 The static and dynamic microbenchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send.
+The static and dynamic send-benchmarks evaluate the average latency for a static actor/message send and a dynamic actor/message send.
 In the static-send benchmark, a message and actor are allocated once and then the message is sent to the same actor 100 million (100M) times.
 The average latency per message send is then calculated by dividing the duration by the number of sends.
 …
 However, Akka and ProtoActor, slow down by two-orders of magnitude.
 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation.
 Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
+Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
 \subsection{Executor}\label{s:executorPerf}
 The microbenchmarks in this section are designed to stress the executor.
+The benchmarks in this section are designed to stress the executor.
 The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of an actor workload.
 Three benchmarks are run: executor, repeat, and high-memory watermark.
 …
 Figures~\ref{f:ExecutorIntel} and~\ref{f:ExecutorAMD} show the results of the AMD and Intel executor benchmark.
 There are three groupings of results, and the difference between AMD and Intel is small.
 CAF is significantly slower than the other actor systems; followed by a tight grouping of uC++, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.
+CAF is significantly slower than the other actor systems; followed by a tight grouping of \uC, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.
 The difference in runtime between \uC and \CFA is largely due to the copy queue described in Section~\ref{s:copyQueue}.
 The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator.
 …
 Note, while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small;
 hence, the relative cost for the RTTI in \uC is significant.
-\begin{figure}
-        \centering
-        \subfloat[AMD Repeat Benchmark]{
-                \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
-                \label{f:RepeatAMD}
+        }
-        \subfloat[Intel Repeat Benchmark]{
-                \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
-                \label{f:RepeatIntel}
+        }
-        \caption{The repeat benchmark comparing actor systems (lower is better).}
-\end{figure}
 The repeat benchmark also evaluates the executor.
 …
 The results are spread out more, and there is a difference between AMD and Intel.
 Again, CAF is significantly slower than the other actor systems.
+On the AMD there is a tight grouping of uC++, ProroActor, and Akka;
+on the Intel, uC++, ProroActor, and Akka are spread out.
+To keep the graphs readable, the y-axis was cut at 100 seconds; as the core count increases from 8-32, CAF ranges around 200 seconds on AMD and between 300-1000 seconds on the Intel.
+On the AMD there is a tight grouping of uC++, ProtoActor, and Akka;
+on the Intel, uC++, ProtoActor, and Akka are spread out.
 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
+This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
+The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}.
+Here, gains from using the copy queue are much less apparent, due to the costs of stealing.
+Here, gains from using the copy queue are much less apparent.
+\begin{figure}
+        \centering
+        \subfloat[AMD Repeat Benchmark]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
+                \label{f:RepeatAMD}
+        }
+        \subfloat[Intel Repeat Benchmark]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
+                \label{f:RepeatIntel}
+        }
+        \caption{The repeat benchmark comparing actor systems (lower is better).}
+\end{figure}
+Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command.
+\CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.
+This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.
+Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
+The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
 \begin{table}
 …
 \end{table}
-Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command..
-\CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.
-This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.
-Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
-The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
 \subsection{Matrix Multiply}
 …
 X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k}
 \end{displaymath}
 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system.
+The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early.
 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
 An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$.
+Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries.
 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results.
 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}.
+It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}.
 \begin{figure}
 …
 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}.
 In this performance section, \CFA's approach is first tested in isolation on pathological unbalanced benchmarks, then with other actor systems on general benchmarks.
+In this performance section, \CFA's approach is first tested in isolation on a pathological unbalanced benchmark, then with other actor systems on general benchmarks.
 Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA.
 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading the receive actors on even cores and the send actors on the odd cores.
+These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading actors only on specific cores (there is one worker per core).
 The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds.
 The balance-one benchmark loads all the work on a single core, whereas the balance-multi loads all the work on half the cores (every other core).
 Given this layout, one expects the ideal speedup of work stealing in the balance-one case to be $N / N - 1$ where $N$ is the number of threads.
 In the balance-multi case the ideal speedup is 0.5.
 Note that in the balance-one benchmark the workload is fixed so decreasing runtime is expected.
 In the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected.
+Given this layout, the ideal speedup of work stealing in the balance-one case should be $N / N - 1$ where $N$ is the number of threads;
+in the balance-multi case, the ideal speedup is 0.5.
+Note, in the balance-one benchmark, the workload is fixed so decreasing runtime is expected;
+in the balance-multi experiment, the workload increases with the number of cores so an increasing or constant runtime is expected.
 \begin{figure}
 …
 \end{figure}
+On both balance microbenchmarks slightly less than ideal speedup compared to the non stealing variation is achieved by both the random and longest victim stealing heuristics.
+On the balance-multi benchmark \ref{f:BalanceMultiAMD},\ref{f:BalanceMultiIntel} the random heuristic outperforms the longest victim.
+This is likely a result of the longest victim heuristic having a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
+Additionally, a performance cost can be observed when hyperthreading kicks in in Figure~\ref{f:BalanceMultiIntel}.
+In the balance-one benchmark on AMD \ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache.
+On Intel \ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading.
+Note that the non stealing variation of balance-one will slow down marginally as the cores increase due to having to create more dummy actors on the inactive cores during startup.
+% On both balance benchmarks, slightly less than ideal speedup compared to the non-stealing variation is achieved by both the random and longest victim stealing heuristics.
+For the balance-one benchmark on AMD in Figure~\ref{f:BalanceOneAMD}, the performance bottoms out at 32 cores onwards likely due to the amount of work becoming less than the cost to steal it and move it across cores and cache.
+On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading.
+Here, the longest-victim and random heuristic are the same.
+Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
+For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim.
+This result is because the longest victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
+Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}.
 \begin{figure}
 …
+        }
         \caption{Executor benchmark comparing \CFA stealing heuristics (lower is better).}
+\end{figure}
+When comparing the \CFA stealing heuristics in Figure~\ref{f:cfaExecutorAMD} it can be seen that the random heuristic falls slightly behind the other two, but in Figure~\ref{f:cfaExecutorIntel} the runtime of all heuristics are nearly identical to each other.
+        \label{f:ExecutorBenchmark}
+\end{figure}
+Figures~\ref{f:cfaExecutorAMD} and~\ref{f:cfaExecutorIntel} show the effects of the stealing heuristics for the executor benchmark.
+For the AMD, in Figure~\ref{f:cfaExecutorAMD}, the random heuristic falls slightly behind the other two, but for the Intel, in Figure~\ref{f:cfaExecutorIntel}, the runtime of all heuristics are nearly identical to each other, except after crossing the 24-core boundary.
 \begin{figure}
 …
+        }
         \caption{The repeat benchmark comparing \CFA stealing heuristics (lower is better).}
+\end{figure}
+This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
+As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around.
+If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache.
+This steal is likely to happen since there is little other work in the system between scatter/gather rounds.
+In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
+The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
+In both the executor and the repeat benchmark CAF performs poorly.
+It is hypothesized that CAF has an aggressive work stealing algorithm, that eagerly attempts to steal.
+This results in poor performance in benchmarks with small messages containing little work per message.
+On the other hand, in \ref{f:MatrixAMD} CAF performs much better since each message has a large amount of work, and few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster.
+        \label{f:RepeatBenchmark}
+\end{figure}
+Figures~\ref{f:cfaRepeatAMD} and~\ref{f:cfaRepeatIntel} show the effects of the stealing heuristics for the repeat benchmark.
+This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
+The single actor (the client) of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
+When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
+This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds.
+However, all heuristics are comparable in performance on the repeat benchmark.
+This result is surprising especially for the No-Stealing variant, which one would expect to have better performance than the stealing variants.
+This is not the case, since the stealing happens lazily and fails fast, the queue containing the long-running client actor is rarely stolen.
+Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively.
+In both benchmarks, CAF performs poorly.
+It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
+This results in the poor performance with small messages containing little work per message in both of these benchmarks.
+In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
+Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark.
+Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message.
+Stealing can still improve performance marginally in the matrix-multiply benchmark.
+In \ref{f:MatrixAMD} CAF performs better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster.
 This hypothesis stems from experimentation with \CFA.
 CAF uses a randomized work stealing heuristic.
 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
+This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
+In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
+The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
+When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
+As such stealing the client can result in a hit in performance.
+In Figures~\ref{f:cfaMatrixAMD} and \ref{f:cfaMatrixIntel} there is little negligible performance difference across \CFA stealing heuristics.
+This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks.
 \begin{figure}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 4acf56d for doc/theses/colby_parsons_MMAth/text

Legend:

TabularUnified doc/theses/colby_parsons_MMAth/text/CFA_intro.tex ¶

TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶

Download in other formats: