Index: doc/theses/colby_parsons_MMAth/text/actors.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 1ae3ac46904e0b164dcec0e13b996849dc2b67a6)
+++ doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 8909a2df40a8f6d082a08016ddafb750b80ac34d)
@@ -11,5 +11,5 @@
 
 \section{Actor Model}
-The actor model is a concurrent paradigm where computation is broken into units of work called actors, and the data for computation is distributed to actors in the form of messages~\cite{Hewitt73}.
+The \Newterm{actor model} is a concurrent paradigm where computation is broken into units of work called actors, and the data for computation is distributed to actors in the form of messages~\cite{Hewitt73}.
 An actor is composed of a \Newterm{mailbox} (message queue) and a set of \Newterm{behaviours} that receive from the mailbox to perform work.
 Actors execute asynchronously upon receiving a message and can modify their own state, make decisions, spawn more actors, and send messages to other actors.
@@ -20,16 +20,16 @@
 An actor is executed by an underlying \Newterm{executor} (kernel thread-pool) that fairly invokes each actor, where an actor invocation processes one or more messages from its mailbox.
 The default number of executor threads is often proportional to the number of computer cores to achieve good performance.
-An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{end of Section~\ref{s:CFAActor}}.
+An executor is often tunable with respect to the number of kernel threads and its scheduling algorithm, which optimize for specific actor applications and workloads \see{end of Section~\ref{s:ActorSystem}}.
 
 \subsection{Classic Actor System}
-An implementation of the actor model with a community of actors is called an actor system.
+An implementation of the actor model with a community of actors is called an \Newterm{actor system}.
 Actor systems largely follow the actor model, but can differ in some ways.
 While the semantics of message \emph{send} is asynchronous, the implementation may be synchronous or a combination.
-The default semantics for message \emph{receive} is FIFO, so an actor receives messages from its mailbox in temporal (arrival) order;
+The default semantics for message \emph{receive} is \gls{fifo}, so an actor receives messages from its mailbox in temporal (arrival) order;
 however, messages sent among actors arrive in any order.
 Some actor systems provide priority-based mailboxes and/or priority-based message-selection within a mailbox, where custom message dispatchers search among or within a mailbox(es) with a predicate for specific kinds of actors and/or messages.
 Some actor systems provide a shared mailbox where multiple actors receive from a common mailbox~\cite{Akka}, which is contrary to the no-sharing design of the basic actor-model (and requires additional locking).
-For non-FIFO service, some notion of fairness (eventual progress) must exist, otherwise messages have a high latency or starve, \ie never received.
-Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism (see Section~\ref{s:SafetyProductivity}).
+For non-\gls{fifo} service, some notion of fairness (eventual progress) must exist, otherwise messages have a high latency or starve, \ie never received.
+Finally, some actor systems provide multiple typed-mailboxes, which then lose the actor-\lstinline{become} mechanism \see{Section~\ref{s:SafetyProductivity}}).
 %While the definition of the actor model provides no restrictions on message ordering, actor systems tend to guarantee that messages sent from a given actor $i$ to actor $j$ will arrive at actor $j$ in the order they were sent.
 Another way an actor system varies from the model is allowing access to shared global-state.
@@ -49,14 +49,15 @@
 The more common design is to \Newterm{shard} the single queue among the executor threads, where actors are permanently assigned or can float among the queues.
 Sharding significantly decreases contention among executor threads adding and removing actors to/from a queue.
-Finally, each actor has a receive queue of messages (mailbox), which is a single consumer, multi-producer queue, \ie only the actor removes from the mailbox but multiple actors can attach messages.
-When an actor receives a message in its mailbox, the actor is marked ready and scheduled by a thread to run the actor's current work unit on the message(s).
+Finally, each actor has a receive queue of messages (mailbox), which is a single consumer, multi-producer queue, \ie only the actor removes from the mailbox but multiple actors add messages.
+When an actor receives a message in its mailbox, the actor is marked ready and scheduled by a thread to run the actor's current behaviour on the message(s).
 
 % cite parallel theatre and our paper
 Figure \ref{f:inverted_actor} shows an actor system designed as \Newterm{message-centric}, where a set of messages are scheduled and run on underlying executor threads~\cite{uC++,Nigro21}.
+This design is \Newterm{inverted} because actors belong to a message queue, whereas in the classic approach a message queue belongs to each actor.
+Now a message send must queries the actor to know which message queue to post the message.
 Again, the simplest design has a single global queue of messages accessed by the executor threads, but this approach has the same contention problem by the executor threads.
 Therefore, the messages (mailboxes) are sharded and executor threads schedule each message, which points to its corresponding actor.
-Here, an actor's messages are permanently assigned to one queue to ensure FIFO receiving and/or reduce searching for specific actor/messages.
-Since multiple actors belong to each message queue, actor messages are interleaved on a queue.
-This design is \Newterm{inverted} because actors belong to a message queue, whereas in the classic approach a message queue belongs to each actor.
+Here, an actor's messages are permanently assigned to one queue to ensure \gls{fifo} receiving and/or reduce searching for specific actor/messages.
+Since multiple actors belong to each message queue, actor messages are interleaved on a queue, but individually in FIFO order.
 % In this inverted actor system instead of each executor threads owning a queue of actors, they each own a queue of messages.
 % In this scheme work is consumed from their queue and executed by underlying threads.
@@ -65,8 +66,8 @@
 % The arrows from the message queues to the actors in the diagram indicate interleaved messages addressed to each actor.
 
-The actor system in \CFA uses a message-centric design, adopts several features from my prior actor work in \uC~\cite{}, and adds the following contributions related to \CFA:
+The actor system in \CFA uses a message-centric design, adopts several features from my prior actor work in \uC~\cite{Buhr22} but implemented in \CFA, and adds the following new \CFA contributions:
 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
 \item
-Provide insight into the impact of envelope allocation in actor systems.
+Provide insight into the impact of envelope allocation in actor systems \see{Section~\ref{s:envelope}}.
 In all actor systems, dynamic allocation is needed to ensure the lifetime of a unit of work persists from its creation until the unit of work is executed.
 This allocation is often called an \Newterm{envelope} as it ``packages'' the information needed to run the unit of work, alongside any other information needed to send the unit of work, such as an actor's address or link fields.
@@ -126,4 +127,5 @@
 \end{cfa}
 \end{lrbox}
+
 \subfloat[dynamic typing]{\label{l:dynamic_style}\usebox\myboxA}
 \hspace*{10pt}
@@ -177,7 +179,8 @@
 \end{figure}
 
-Figure~\ref{f:CFAActor} shows a complete \CFA actor example starting with the actor type @my_actor@ created by defining a @struct@ that inherits from the base @actor@ @struct@ via the @inline@ keyword.
+Figure~\ref{f:CFAActor} shows a complete \CFA actor example, which is discussed in detail.
+The actor type @my_actor@ is a @struct@ that inherits from the base @actor@ @struct@ via the @inline@ keyword.
 This inheritance style is the Plan-9 C-style inheritance discussed in Section~\ref{s:Inheritance}.
-Similarly, the message types @str_msg@ and @int_msg@ are created by defining a @struct@ that inherits from the base @message@ @struct@ via the @inline@ keyword.
+Similarly, the message types @str_msg@ and @int_msg@ are @struct@s that inherits from the base @message@ @struct@ via the @inline@ keyword.
 Only @str_msg@ needs a constructor to copy the C string;
 @int_msg@ is initialized using its \CFA auto-generated constructors.
@@ -187,6 +190,6 @@
 The program main begins by creating two messages on the stack.
 Then the executor system is started by calling @start_actor_system@.
-Now an actor is created on the stack and four messages are sent it using operator @?|?@.
-The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to removes the actor from the actor system \see{Section~\ref{s:ActorBehaviours}}.
+Now an actor is created on the stack and four messages are sent to it using operator @?|?@.
+The last message is the builtin @finish_msg@, which returns @Finished@ to an executor thread, causing it to remove the actor from the actor system \see{Section~\ref{s:ActorBehaviours}}.
 The call to @stop_actor_system@ blocks the program main until all actors are finished and removed from the actor system.
 The program main ends by deleting the actor and two messages from the stack.
@@ -199,5 +202,5 @@
 
 \subsection{Actor Behaviours}\label{s:ActorBehaviours}
-In general, a behaviour for some derived actor and derived message type is defined with following signature:
+In general, a behaviour for some derived actor and derived message type is defined with the following signature:
 \begin{cfa}
 allocation receive( my_actor & receiver, my_msg & msg )
@@ -213,5 +216,5 @@
 Message state is updated via a call to:
 \begin{cfa}
-void set_allocation( message & this, allocation state )
+void set_allocation( message & this, allocation state );
 \end{cfa}
 
@@ -220,5 +223,5 @@
 \noindent@Nodelete@
 tells the executor that no action is to be taken with regard to an actor or message.
-This status is used when an actor continues receiving messages or a message may be reused.
+This status is used when an actor continues receiving messages or a message is reused.
 
 \noindent@Delete@
@@ -233,17 +236,22 @@
 tells the executor to mark the respective actor as finished executing, but not call the object's destructor nor deallocate the object.
 This status is used when actors or messages are global or stack allocated, or a programmer wants to manage deallocation themselves.
+Note, for messages, there is no difference between allocations @Nodelete@ and @Finished@ because both tell the executor to do nothing to the message.
+Hence, @Finished@ is implicitly changed to @Nodelete@ in a message constructor, and @Nodelete@ is used internally for message error-checking \see{Section~\ref{s:SafetyProductivity}}.
+Therefore, reading a message's allocation status after setting to @Finished@ may be either @Nodelete@ (after construction) or @Finished@ (after explicitly setting using @set_allocation@).
 
 For the actor system to terminate, all actors must have returned a status other than @Nodelete@.
 After an actor is terminated, it is erroneous to send messages to it.
 Similarly,  after a message is terminated, it cannot be sent to an actor.
-Note, it is safe to construct an actor or message with a status other than @Nodelete@, since the executor only examines the allocation action after a behaviour returns.
+Note, it is safe to construct an actor or message with a status other than @Nodelete@, since the executor only examines the allocation action \emph{after} a behaviour returns.
 
 \subsection{Actor Envelopes}\label{s:envelope}
 As stated, each message, regardless of where it is allocated, can be sent to an arbitrary number of actors, and hence, appear on an arbitrary number of message queues.
-Because a C program manages message lifetime, messages cannot be copied for each send, otherwise who manages the copies.
-Therefore, it up to the actor program to manage message life-time across receives.
+Because a C program manages message lifetime, messages cannot be copied for each send, otherwise who manages the copies?
+Therefore, it is up to the actor program to manage message life-time across receives.
 However, for a message to appear on multiple message queues, it needs an arbitrary number of associated destination behaviours.
 Hence, there is the concept of an envelop, which is dynamically allocated on each send, that wraps a message with any extra implementation fields needed to persist between send and receive.
 Managing the envelop is straightforward because it is created at the send and deleted after the receive, \ie there is 1:1 relationship for an envelop and a many to one relationship for a message.
+
+\PAB{Do you need to say more here?}
 
 % In actor systems, messages are sent and received by actors.
@@ -275,5 +283,5 @@
 \noindent@void start_actor_system( executor & this )@
 allows the programmer to explicitly create and configure an executor for use by the actor system.
-Executor configuration options include are discussed in Section~\ref{s:executor}.
+Executor configuration options are discussed in Section~\ref{s:executor}.
 
 \noindent
@@ -282,23 +290,60 @@
 \subsection{Actor Send}\label{s:ActorSend}
 All message sends are done using the vertical-bar (bit-or) operator, @?|?@, similar to the syntax of the \CFA stream I/O.
-Hence, programmers must write a matching @?|?@ routine for each @receive@ routine, which is awkward and generates a maintenance problem that must be solved.
-\CFA's current best approach for creating a generic @?|?@ routine requires users to create routines for their actor and message types that access the base type.
-Since these routines are not complex, they can be generated using macros that are used to annotate the user-defined actor and message types.
-This approach is used in \CFA's intrusive list data structure.
-This is not much better than asking users to write the @?|?@ routine themselves in terms of maintenance, since the user needs to remember the boilerplate macro annotation.
-
-As stated, \CFA does not have named inheritance with RTTI.
-\CFA does have a preliminary form of virtual routines, but it is not mature enough for use in this work.
-Virtuals would provide a mechanism to write a single generic @?|?@ routine taking a base actor and message type, that would dynamically select the @receive@ routine from the actor argument.
-Note, virtuals are not needed for the send; Plan-9 inheritance is sufficient because only the inherited fields are needed during the message send (only upcasting is needed).
-
-Therefore, a template-like approach was chosen, where the compiler generates a matching @?|?@ routine for each @receive@ routine it finds with the correct actor/message type-signature.
-This approach requires no annotation or additional code to be written by users, thus it resolves the maintenance problem.
-(When the \CFA virtual routines mature, it should be possible to seamlessly transition to it from the template approach.)
+One way to provide this operator is through the \CFA type system:
+\begin{cfa}
+actor & ?|?( actor &, message & ) { // base actor and message types
+	// boilerplate to send message to executor mail queue
+}
+actor | str_msg | int_msg;   // rewritten: ?|?( ?|?( actor, int_msg ), str_msg )
+\end{cfa}
+In the \CFA type system, calls to this routine work for any pair of parameters that inherit from the @actor@ and @message@ types via Plan-9 inheritance.
+However, within the body the routine, all type information about the derived actor and message is lost (type erasure), so this approach is unable to find the right @receive@ routine to put in the envelope.
+
+If \CFA had a fully-fledged virtual system, the generic @?|?@ routine would work, since the virtual system could dynamically select the derived @receive@ routine via virtual dispatch.
+\CFA does have a preliminary form of virtual routines, but it is not mature enough for use in this work, so a different approach is needed.
+
+Without virtuals, the idiomatic \CFA way to create the generic @?|?@ routine is using @forall@:
+\begin{cfa}
+// forall types A, M that have a receive that returns allocation
+forall( A &, M & | { allocation receive( A &, M & ); } ) 
+A & ?|?( A &, M & ) { // actor and message types
+	// boilerplate to send message to executor mail queue
+}
+\end{cfa}
+This approach should work.
+However, the \CFA type system is still a work in progress, and there is a nontrivial bug where inherited routines are not recognized by @forall@.
+For example, Figure~\ref{f:type_problem} shows type @B@ has an inherited @foo@ routine through type @A@ and should find the @bar@ routine defined via the @forall@, but does not due the type-system bug.
+
+\begin{figure}
+\begin{cfa}
+struct A {};
+struct B { inline A; }
+void foo( A & a ) { ... }
+
+// for all types that have a foo routine here is a bar routine
+forall( T & | { void foo( T & ); } ) 
+void bar( T & t ) { ... }
+
+int main() {
+	B b;
+	foo( b ); // B has a foo so it should find a bar via the forall
+	bar( b ); // compilation error, no bar found for type B
+}
+\end{cfa}
+\caption{\CFA Type-System Problem}
+\label{f:type_problem}
+\end{figure}
+
+Users could be expected to write the @?|?@ routines, but this approach is error prone and creates maintenance issues.
+Until the \CFA type-system matures, I created a workaround using a template-like approach, where the compiler generates a matching @?|?@ routine for each @receive@ routine it finds with the correct actor/message type-signature.
+This workaround is outside of the type system, but performing a type-system like action.
+The workaround requires no annotation or additional code to be written by users;
+thus, it resolves the maintenance and error problems.
+It should be possible to seamlessly transition the workaround into any updated version of the \CFA type-system.
 
 Figure~\ref{f:send_gen} shows the generated send routine for the @int_msg@ receive in Figure~\ref{f:CFAActor}.
 Operator @?|?@ has the same parameter signature as the corresponding @receive@ routine and returns an @actor@ so the operator can be cascaded.
 The routine sets @rec_fn@ to the matching @receive@ routine using the left-hand type to perform the selection.
-Then the routine packages the base and derived actor and message and actor, along with the receive routine into an \hyperref[s:envelope]{envelope}.
+Then the routine packages the actor and message, along with the receive routine into an envelope.
 Finally, the envelop is added to the executor queue designated by the actor using the executor routine @send@.
 
@@ -306,6 +351,6 @@
 \begin{cfa}
 $\LstCommentStyle{// from Figure~\ref{f:CFAActor}}$
-struct my_actor { inline actor; }; 						$\C[3.75in]{// actor}$
-struct int_msg { inline message; int i; }; 				$\C{// message}$
+struct my_actor { inline actor; };						$\C[3.75in]{// actor}$
+struct int_msg { inline message; int i; };				$\C{// message}$
 allocation receive( @my_actor &, int_msg & msg@ ) {...}	$\C{// receiver}$
 
@@ -314,6 +359,6 @@
 actor & ?|?( @my_actor & receiver, int_msg & msg@ ) {
 	allocation (*rec_fn)( my_actor &, int_msg & ) = @receive@; // deduce receive routine
-	request req{ &receiver, (actor *)&receiver, &msg, (message *)&msg, (receive_t)rec_fn };
-	send( receiver, req ); 								$\C{// queue message for execution}\CRT$
+	request req{ (actor *)&receiver, (message *)&msg, (receive_t)rec_fn };
+	send( receiver, req );					 			$\C{// queue message for execution}\CRT$
 	return receiver;
 }
@@ -323,56 +368,97 @@
 \end{figure}
 
+Figure~\ref{f:ConvenienceMessages} shows three builtin convenience messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@.
+For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor.
+Note, assignment is used to initialize these messages rather than constructors because the constructor changes the allocation to @Nodelete@ for error checking
+
+\begin{figure}
+\begin{cfa}
+message __base_msg_finished $@$= { .allocation_ : Finished };
+struct delete_msg_t { inline message; } delete_msg = __base_msg_finished;
+struct destroy_msg_t { inline message; } destroy_msg = __base_msg_finished;
+struct finished_msg_t { inline message; } finished_msg = __base_msg_finished;
+
+allocation receive( actor & this, delete_msg_t & msg ) { return Delete; }
+allocation receive( actor & this, destroy_msg_t & msg ) { return Destroy; }
+allocation receive( actor & this, finished_msg_t & msg ) { return Finished; }
+\end{cfa}
+\caption{Builtin Convenience Messages}
+\label{f:ConvenienceMessages}
+\end{figure}
+
 \subsection{Actor Termination}\label{s:ActorTerm}
 During a message send, the receiving actor and message being sent are stored via pointers in the envelope.
-These pointers have the base actor and message types, so type information of the actor and message is lost and then recovered later when the typed receive routine is called.
+These pointers are the base actor and message types, so type information of the derived actor and message is lost and must be recovered later when the typed receive routine is called.
 After the receive routine is done, the executor must clean up the actor and message according to their allocation status.
 If the allocation status is @Delete@ or @Destroy@, the appropriate destructor must be called by the executor.
-This poses a problem; the correct type of the actor or message is not available to the executor, but it needs to call the right destructor!
-This requires downcasting from the base type to derived type, which requires a virtual system.
-Thus, a rudimentary destructor-only virtual system was added to \CFA as part of this work.
+This poses a problem;
+the derived type of the actor or message is not available to the executor, but it needs to call the derived destructor.!
+This requires downcasting from the base type to the derived type, which requires a virtual system.
+To accomplish the dowcast, I implemented a rudimentary destructor-only virtual system in \CFA.
 This virtual system is used via Plan-9 inheritance of the @virtual_dtor@ type, shown in Figure~\ref{f:VirtDtor}.
 The @virtual_dtor@ type maintains a pointer to the start of the object, and a pointer to the correct destructor.
-When a type inherits the @virtual_dtor@ type, the compiler adds code to its destructor to make sure that any destructor calls along this segment of the inheritance tree is called are intercepted, and restart at the appropriate destructor for that object.
-
-\begin{figure}
-\begin{cfa}
-struct base_type { inline virtual_dtor; };
-struct intermediate_type { inline base_type; };
-struct derived_type { inline intermediate_type; };
+When a type inherits @virtual_dtor@, the compiler adds code to its destructor to intercepted any destructor calls along this segment of the inheritance tree and restart at the appropriate destructor for that object.
+
+\begin{figure}
+\centering
+
+\begin{lrbox}{\myboxA}
+\begin{cfa}
+struct base { inline virtual_dtor; };
+void ^?{}( base & ) { sout | "^base"; }
+struct intermediate { inline base; };
+void ^?{}( intermediate & ) { sout | "^intermediate"; }
+struct derived { inline intermediate; };
+void ^?{}( derived & ) { sout | "^derived"; }
 
 int main() {
-    derived_type d1, d2, d3;
-    intermediate_type & i = d2;
-    base_type & b = d3;
-
-    // these will call the destructors in the correct order
-    ^d1{}; ^i{}; ^b{}; 
-}
-
-\end{cfa}
+	base & b;
+	intermediate i;
+	derived d1, d2, d3;
+	intermediate & ri = d2;
+	base & rb = d3;
+	// explicit destructor calls
+	^d1{};	sout | nl;
+	^ri{};	sout | nl;
+	^rb{}; 	sout | nl;
+} // ^i, ^b
+\end{cfa}
+\end{lrbox}
+
+\begin{lrbox}{\myboxB}
+\begin{cfa}
+^derived
+^intermediate
+^base
+
+^derived
+^intermediate
+^base
+
+^derived
+^intermediate
+^base
+
+^intermediate
+^base
+
+
+
+
+\end{cfa}
+
+\end{lrbox}
+\subfloat[Destructor calls]{\label{l:destructor_calls}\usebox\myboxA}
+\hspace*{10pt}
+\vrule
+\hspace*{10pt}
+\subfloat[Output]{\usebox\myboxB}
+
 \caption{\CFA Virtual Destructor}
 \label{f:VirtDtor}
 \end{figure}
 
-This virtual destructor system was built for this work, but is general and can be used in any type in \CFA.
+While this virtual destructor system was built for this work, it is general and can be used in any type in \CFA.
 Actors and messages opt into this system by inheriting the @virtual_dtor@ type, which allows the executor to call the right destructor without knowing the derived actor or message type.
-
-Figure~\ref{f:ConvenienceMessages} shows three builtin convenience messages and receive routines used to terminate actors, depending on how an actor is allocated: @Delete@, @Destroy@ or @Finished@.
-For example, in Figure~\ref{f:CFAActor}, the builtin @finished_msg@ message and receive are used to terminate the actor because the actor is allocated on the stack, so no deallocation actions are performed by the executor.
-
-\begin{figure}
-\begin{cfa}
-message __base_msg_finished $@$= { .allocation_ : Finished }; // no auto-gen constructors
-struct __delete_msg_t { inline message; } delete_msg = __base_msg_finished;
-struct __destroy_msg_t { inline message; } destroy_msg = __base_msg_finished;
-struct __finished_msg_t { inline message; } finished_msg = __base_msg_finished;
-
-allocation receive( actor & this, __delete_msg_t & msg ) { return Delete; }
-allocation receive( actor & this, __destroy_msg_t & msg ) { return Destroy; }
-allocation receive( actor & this, __finished_msg_t & msg ) { return Finished; }
-\end{cfa}
-\caption{Builtin Convenience Messages}
-\label{f:ConvenienceMessages}
-\end{figure}
 
 \section{\CFA Executor}\label{s:executor}
@@ -380,5 +466,5 @@
 An executor of an actor system is the scheduler that organizes where actor behaviours are run and how messages are sent and delivered.
 In \CFA, the executor is message-centric \see{Figure~\ref{f:inverted_actor}}, but extended by over sharding of a message queue \see{left side of Figure~\ref{f:gulp}}, \ie there are $M$ message queues where $M$ is greater than the number of executor threads $N$ (usually a multiple of $N$).
-This approach reduces contention by spreading message delivery among the $M$ queues rather than $N$, while still maintaining actor FIFO message-delivery semantics.
+This approach reduces contention by spreading message delivery among the $M$ queues rather than $N$, while still maintaining actor \gls{fifo} message-delivery semantics.
 The only extra overhead is each executor cycling (usually round-robin) through its $M$/$N$ queues.
 The goal is to achieve better performance and scalability for certain kinds of actor applications by reducing executor locking.
@@ -444,72 +530,127 @@
 \section{Work Stealing}\label{s:steal}
 Work stealing is a scheduling strategy to provide \Newterm{load balance}.
-The goal is to increase resource utilization by having idle threads steal work from working threads.
-While there are multiple parts in work-stealing scheduler, the two important components are victim selection and the stealing mechanism.
+The goal is to increase resource utilization by having an idle thread steal work from a working thread.
+While there are multiple parts in a work-stealing scheduler, the two important components are the stealing mechanism and victim selection.
 
 \subsection{Stealing Mechanism}
-In work stealing, the stealing worker is called the \Newterm{thief} and the stolen-from worker is called the \Newterm{victim}.
-The stealing mechanism presented here differs from existing work-stealing actor-systems because of the message-centric (inverted) actor-system.
-Other actor systems, such as Akka~\cite{Akka} and CAF~\cite{CAF}, have work stealing, but use an actor-centric system where stealing is dequeuing from a non-empty ready-queue to an empty ready-queue.
-As an example, in CAF, the sharded actor queue is a set of double-ended queues (dequeues).
-When an actor has messages, it is inserted into a worker's dequeue (ready queue).
-Workers then consume actors from the dequeue and execute their behaviours.
-To steal work, thieves take one or more actors from a victim's dequeue.
-By the pigeon hole principle, there are three dequeue operations (push/victim pop/thief pop) that can occur concurrently and only two ends to a dequeue, so work stealing in a dequeue-based system always results in a potential increase in contention on the dequeues.
+In work stealing, the stealing worker is called the \Newterm{thief} and the stolen worker is called the \Newterm{victim}.
+% Workers consume actors from their ready queue and execute their behaviours.
+% Through these behaviours, a worker inserts messages onto its own and other worker ready-queues.
+To steal, a thieve takes one or more actors from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.
 This contention can slows down the victim's throughput.
-Note, which end of the dequeue is used for stealing, consuming, and inserting is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations.
-
-Work steal now becomes queue stealing, where an entire actor/message queue is stolen, which trivially preserves message ordering in a queue \see{Section~\ref{s:steal}}.
+Note, the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations.
 
 % C_TODO: maybe insert stealing diagram
 
+The stealing mechanism in this work differs from most work-stealing actor-systems because of the message-centric (inverted) actor-system.
+Actor systems, such as Akka~\cite{Akka} and CAF~\cite{CAF} using actor-centric systems, steal by dequeuing from a non-empty actor ready-queue and enqueue\-ing to an empty ready-queue.
+% As an example, in CAF, the sharded actor ready queue is a set of double-ended queues (dequeues).
 In \CFA, the actor work-stealing implementation is unique because of the message-centric system.
-In this system, it is impractical to steal actors because an actor's messages are distributed in temporal order along the message queue.
-To ensure sequential actor execution and FIFO message delivery, actor stealing requires finding and removing all of an actor's messages, and inserting them consecutively in another message queue.
+With this approach, it is impractical to steal actors because an actor's messages are distributed in temporal order along the message queue.
+To ensure sequential actor execution and \gls{fifo} message delivery, actor stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.
 This operation is $O(N)$ with a non-trivial constant.
-The only way for work stealing to become practical is to shard the message queue, which also reduces contention, and steal queues to eliminate queue searching.
-
-Given queue stealing, the goal is to have a zero-victim-cost stealing mechanism, which does not mean stealing has no cost.
-It means work stealing does not affect the performance of the victim worker.
+The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and steal queues to eliminate queue searching.
+
+Given queue stealing, my goal is to have an essentially zero-contention-cost stealing mechanism.
+This goal means work stealing has minimal affect on the performance of the victim.
 The implication is that thieves cannot contend with a victim, and that a victim should perform no stealing related work unless it becomes a thief.
-In theory, this goal is not achievable, but results show the goal is achieved in practice.
-
-In \CFA's actor system, workers own a set of sharded queues, which they iterate over and gulp.
-If a worker has iterated over its message queues twice without finding any work, it tries to steal a queue from another worker.
-Stealing a queue is done wait-free with a few atomic instructions that can only create contention with other stealing workers, not the victim.
-To steal a queue, a worker does the following:
+In theory, this goal is not achievable, but practical results show the goal is virtually achieved.
+
+One important lesson I learned working on \uC actors~\cite{} and talking with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.
+With reasonable workloads, being a thief should be a temporary state, \ie eventually work appears on the thief's ready-queues and it returns to normal operation.
+Furthermore, the act of \emph{looking} to find work is invasive (Heisenberg uncertainty principle), possibly disrupting multiple victims.
+Therefore, stealing should be done lazily in case work appears for the thief and to minimize disruption of victims.
+Note, the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block.
+
+My lazy-stealing approach is to select a victim, scan its queues once, and return immediately if a queue is stolen.
+Then perform a regular scan looking for work, where stolen work is placed at the end of the scan.
+Hence, only one victim is effected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work.
+If no work is found in the thief's queues because a steal is unsuccessful, it performs another steal from a different victim.
+This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system.
+In all work-stealing algorithms, there is the pathological case where there is too little work and too many workers;
+this scenario can only be handled by putting workers to sleep or deleting them.
+This case is not examined in this work.
+
+In detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
+Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers, not the victim.
+The complexity in the implementation is that victim gulping does not take the mailbox queue;
+rather it atomically transfers the mailbox queue to another list leaving the mailbox empty.
+Hence, the original list is available for new message deliveries.
+However, this transfer logically subdivides the mailbox, and during this period, the mailbox cannot be stolen;
+otherwise there are two threads simultaneously running messages on actors in the two parts of the mailbox queue.
+To solve this problem, an atomic gulp also marks the mailbox queue as subdivided, making it ineligible for stealing.
+Hence, a thief checks for ineligible and non-empty before attempting an atomic steal of a queue,
+
+
+
+To steal a queue, a thief does the following:
 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
 \item
-The thief chooses a victim, which is trivial because all workers are stored in a shared array.
-
-\item
-The thief starts at a random index in the array of the victim's queues and searches for a candidate queue.
-A candidate queue is any non-empty queue not being processed by the victim and not being stolen by another thief.
-These rules are not strictly enforced.
-A candidate is identified non-atomically, and as such, queues that do not satisfy these rules may be stolen.
-However, steals not meeting the rules do not affect correctness and do not constitute failed steals as the queue is always swapped.
-
-\item
-Once a candidate queue is chosen, the thief attempts a wait-free swap of a victim's queue to a random empty thief queue.
-If the swap successes, the steal is completed.
-If the swap fails, the victim may have been gulping that message queue or another thief must have attempted to steal the victim's queue.
-In either case, that message queue is highly likely to be empty.
-
-\item
-Once a thief fails or succeeds in stealing a queue, it iterates over its messages queues again because new messages may have arrived during stealing.
+chooses a victim, which is trivial because there is an array of $N$ pointers to the mailboxes (copy queues) that is subdivided into $M$/$N$ worker ranges \see{Section~\ref{s:executor}}.
+
+\item
+scan the victim's mailbox-range and test each mail queue in that range for a non-empty queue by performing a test-for-non-empty and swap.
+The conditional check reduces the number of atomic operations.
+The swap races to interchange the non-empty mail-queue pointer from the victim's range with an empty mail-queue pointer in the thief's range.
+Regardless, either a thief swaps or the victim gulps the mail queue;
+correctness is guaranteed, no matter which thread wins the race, because both threads are cycling through appropriate lists.
+Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
+
+\item
+stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
+The thief then returns to its scheduler and iterates over its messages queues, because new messages may have arrived during stealing, including a stolen queue.
 Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
 \end{enumerate}
-
-The key to the stealing mechanism is that the queues can still be operated on while they are being swapped.
-This functionality eliminates any contention among thieves and victims.
-
-The first key to this is that actors and workers maintain two distinct arrays of references to queues.
-Actors will always receive messages via the same queues.
-Workers, on the other hand will swap the pointers to queues in their shared array and operate on queues in the range of that array that they own.
-Swapping queues is a matter of atomically swapping two pointers in the worker array.
-As such pushes to the queues can happen concurrently during the swap since pushes happen via the actor queue references.
-
-Gulping can also occur during queue swapping, but the implementation requires more nuance than the pushes.
-When a worker is not stealing it iterates across its own range of queues and gulps them one by one.
-When a worker operates on a queue it first copies the current pointer from the worker array of references to a local variable.
+This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
+As well, pushes to the queues by other workers can happen concurrently during the swap since pushes happen via the actor queue references.
+
+
+
+\begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
+\item
+selects a random worker's mailbox-range and tests each mail queue in that range for a non-empty queue.
+
+A queue can be stolen and gulped at the same time since they are independent operations.
+However, two workers cannot gulp from the same queue at the same time since that would violate the ordering and mutual exclusion guarantees of actors, so an avail flag is used on queues to stop this from happening.
+If a worker tries to gulp a queue and sees an avail flag up, the gulp fails, and it will move on to other queues and try to gulp it again later.
+stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
+It stops searching after any queue steal (even an empty one) or all queues are examined.
+
+Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
+Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
+\end{enumerate}
+This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
+The first and last sentence here are correct. I'm not sure I understand the second sentence.
+
+
+
+As an aside I think I need a diagram here since it is clear that the algorithm is not clear through my writing.
+I made a very quick one that I'll use here to explain, but I will work on one to add to the chapter.
+
+
+To give some accompanying code there are two arrays, the array of copy queues, and the array of pointers to the copy queues:
+\begin{cfa}
+work_queue * request_queues;                   // master array of work request queues
+work_queue ** worker_req_queues;                // secondary array of work queues to allow for swapping
+\end{cfa}
+\includegraphics[width=4in]{../figures/steal_diagram.eps}
+When a send happens, it inserts a request directly into one of the queues in @request_queues@, when a steal happens two pointers in @worker_req_queues@ are swapped.
+Each worker owns a section of @worker_req_queues@.
+When a worker gulps, it dereferences one of the pointers in its section of @worker_req_queues@ and then gulps from the queue it points at.
+As such, operations can happen on the queues directly independent of stealing, which avoids almost all contention between stealing threads and busy threads.
+
+If you want to poke some of the code, in @actor.hfa@ @steal_work()@ selects the victim, @choose_queue()@ iterates over the queues looking for non-empty queues and @try_swap_queues()@ performs the thread safe pointer swap.
+
+% The first key to this is that actors and workers maintain two distinct arrays of references to queues.
+% Actors will always receive messages via the same queues.
+% Workers, on the other hand will swap the pointers to queues in their shared array and operate on queues in the range of that array that they own.
+% Swapping queues is a matter of atomically swapping two pointers in the worker array.
+% As such pushes to the queues can happen concurrently during the swap since pushes happen via the actor queue references.
+
+% Gulping can also occur during queue swapping, but the implementation requires more nuance than the pushes.
+% When a worker is not stealing it iterates across its own range of queues and gulps them one by one.
+
+Correctness of gulping is slightly more complex with stealing.
+When a worker operates on a queue, it first copies the queue pointer from the mailbox array to a local variable.
 It then uses that local variable for all queue operations until it moves to the next index of its range of the queue array.
 This ensures that any swaps do not interrupt gulping operations, however this introduces a correctness issue.
@@ -650,4 +791,5 @@
 In the failed case of step 3 the program state is safely restored to its state it had prior to the @0p@ write in step 2, thanks to the invariant that makes the write back to the @0p@ pointer safe.
 
+\begin{comment}
 \subsection{Stealing Guarantees}
 Given that the stealing operation can potentially fail, it is important to discuss the guarantees provided by the stealing implementation.
@@ -754,4 +896,5 @@
 Thus, by contrapositive, if the graph contains a cycle then there exists a situation where no swaps occur.
 Hence, at least one swap is guaranteed to succeed if and only if the graph does not contain a cycle.
+\end{comment}
 
 % C_TODO: go through and use \paragraph to format to make it look nicer
@@ -832,5 +975,5 @@
 
 Another productivity feature that is included is a group of poison-pill messages.
-Poison-pill messages are common across actor systems, and are used in actor libraries Akka and ProtoActor~\cite{Akka,ProtoActor}.
+Poison-pill messages are common across actor systems, including Akka and ProtoActor \cite{Akka,ProtoActor}.
 Poison-pill messages inform an actor to terminate.
 In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills.
@@ -876,7 +1019,7 @@
 	& \multicolumn{1}{c|}{\CFA (100M)} & \multicolumn{1}{c|}{CAF (10M)} & \multicolumn{1}{c|}{Akka (100M)} & \multicolumn{1}{c|}{\uC (100M)} & \multicolumn{1}{c@{}}{ProtoActor (100M)} \\
 	\hline
-	AMD		& \input{data/nasusSendStatic} \\
+	AMD		& \input{data/pykeSendStatic} \\
 	\hline
-	Intel	& \input{data/pykeSendStatic}
+	Intel	& \input{data/nasusSendStatic}
 \end{tabular}
 
@@ -889,7 +1032,7 @@
 	& \multicolumn{1}{c|}{\CFA (20M)} & \multicolumn{1}{c|}{CAF (2M)} & \multicolumn{1}{c|}{Akka (2M)} & \multicolumn{1}{c|}{\uC (20M)} & \multicolumn{1}{c@{}}{ProtoActor (2M)} \\
 	\hline
-	AMD		& \input{data/nasusSendDynamic} \\
+	AMD		& \input{data/pykeSendDynamic} \\
 	\hline
-	Intel	& \input{data/pykeSendDynamic}
+	Intel	& \input{data/nasusSendDynamic}
 \end{tabular}
 \end{table}
@@ -911,5 +1054,5 @@
 The results from the static/dynamic send benchmarks are shown in Figures~\ref{t:StaticActorMessagePerformance} and \ref{t:DynamicActorMessagePerformance} respectively.
 \CFA leads the charts in both benchmarks, largely due to the copy queue removing the majority of the envelope allocations.
-Additionally, the receive of all messages sent in \CFA is statically known and is determined via a function pointer cast, which incurrs a compile-time cost.
+Additionally, the receive of all messages sent in \CFA is statically known and is determined via a function pointer cast, which incurs a compile-time cost.
 All the other systems use their virtual system to find the correct behaviour at message send.
 This requires two virtual dispatch operations, which is an additional runtime send cost that \CFA does not have.
@@ -1056,5 +1199,5 @@
 
 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
-In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
+In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
 In both the executor and the repeat benchmark CAF performs poorly.
@@ -1083,5 +1226,5 @@
 
 Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores.
-\CFA has a high watermark relative to the other non-garbage-collected systems \uC, and CAF.
+\CFA has a high watermark relative to the other non-garbage collected systems \uC, and CAF.
 This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed.
 Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems.