Index: doc/theses/colby_parsons_MMAth/text/actors.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 8909a2df40a8f6d082a08016ddafb750b80ac34d)
+++ doc/theses/colby_parsons_MMAth/text/actors.tex	(revision f519bd80c6e089c8b3b09b5c2c400428aebde72b)
@@ -529,17 +529,15 @@
 
 \section{Work Stealing}\label{s:steal}
-Work stealing is a scheduling strategy to provide \Newterm{load balance}.
+Work stealing is a scheduling strategy to provide \Newterm{load balancing}.
 The goal is to increase resource utilization by having an idle thread steal work from a working thread.
-While there are multiple parts in a work-stealing scheduler, the two important components are the stealing mechanism and victim selection.
+While there are multiple parts in a work-stealing scheduler, two important components are the stealing mechanism and victim selection.
 
 \subsection{Stealing Mechanism}
-In work stealing, the stealing worker is called the \Newterm{thief} and the stolen worker is called the \Newterm{victim}.
+In work stealing, the stealing worker is called the \Newterm{thief} and the worker being stolen from is called the \Newterm{victim}.
 % Workers consume actors from their ready queue and execute their behaviours.
 % Through these behaviours, a worker inserts messages onto its own and other worker ready-queues.
-To steal, a thieve takes one or more actors from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.
+To steal, a thief takes work from a victim's ready queue, so work stealing always results in a potential increase in contention on ready queues between the victim gulping from a queue and the thief stealing the queue.
 This contention can slows down the victim's throughput.
 Note, the data structure used for the ready queue is not discussed since the largest cost is the mutual exclusion and its duration for safely performing the queue operations.
-
-% C_TODO: maybe insert stealing diagram
 
 The stealing mechanism in this work differs from most work-stealing actor-systems because of the message-centric (inverted) actor-system.
@@ -548,14 +546,22 @@
 In \CFA, the actor work-stealing implementation is unique because of the message-centric system.
 With this approach, it is impractical to steal actors because an actor's messages are distributed in temporal order along the message queue.
-To ensure sequential actor execution and \gls{fifo} message delivery, actor stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.
+To ensure sequential actor execution and \gls{fifo} message delivery in a message-centric system,  stealing requires finding and removing \emph{all} of an actor's messages, and inserting them consecutively in another message queue.
 This operation is $O(N)$ with a non-trivial constant.
-The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and steal queues to eliminate queue searching.
-
-Given queue stealing, my goal is to have an essentially zero-contention-cost stealing mechanism.
-This goal means work stealing has minimal affect on the performance of the victim.
+The only way for work stealing to become practical is to shard each worker's message queue, which also reduces contention, and to steal queues to eliminate queue searching.
+
+\begin{figure}
+\begin{center}
+\input{diagrams/steal.tikz}
+\end{center}
+\caption{Queue Stealing Mechanism}
+\label{f:steal}
+\end{figure}
+
+Given queue stealing, the goal of the presented work stealing implementation is to have an essentially zero-contention-cost stealing mechanism.
+Achieving this goal requires work stealing to have minimal (practically no) effect on the performance of the victim.
 The implication is that thieves cannot contend with a victim, and that a victim should perform no stealing related work unless it becomes a thief.
 In theory, this goal is not achievable, but practical results show the goal is virtually achieved.
 
-One important lesson I learned working on \uC actors~\cite{} and talking with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.
+One important lesson learned while working on \uC actors~\cite{} and through discussions with fellow student Thierry Delisle, who examined work-stealing for user-threads in his Ph.D., is \emph{not} to aggressively steal.
 With reasonable workloads, being a thief should be a temporary state, \ie eventually work appears on the thief's ready-queues and it returns to normal operation.
 Furthermore, the act of \emph{looking} to find work is invasive (Heisenberg uncertainty principle), possibly disrupting multiple victims.
@@ -563,8 +569,7 @@
 Note, the cost of stealing is not crucial for the thief because it does not have anything else to do except poll or block.
 
-My lazy-stealing approach is to select a victim, scan its queues once, and return immediately if a queue is stolen.
-Then perform a regular scan looking for work, where stolen work is placed at the end of the scan.
-Hence, only one victim is effected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work.
-If no work is found in the thief's queues because a steal is unsuccessful, it performs another steal from a different victim.
+As such a lazy-stealing approach is used; a thief select a victim, scan its queues once, and return immediately if a queue is stolen.
+The thief then returns to normal operation and conducts a regular scan over its own queues looking for work, where stolen work is placed at the end of the scan.
+Hence, only one victim is affected and there is a reasonable delay between stealing events by scanning all the thief's ready queue looking for its own work.
 This lazy examination by the thief has a low perturbation cost for victims, while still finding work in a moderately loaded system.
 In all work-stealing algorithms, there is the pathological case where there is too little work and too many workers;
@@ -572,5 +577,5 @@
 This case is not examined in this work.
 
-In detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
+In more detail, the \CFA work-stealing algorithm begins by iterating over its message queues twice without finding any work before it tries to steal a queue from another worker.
 Stealing a queue is done wait-free (\ie no busy waiting) with a few atomic instructions that only create contention with other stealing workers, not the victim.
 The complexity in the implementation is that victim gulping does not take the mailbox queue;
@@ -580,96 +585,105 @@
 otherwise there are two threads simultaneously running messages on actors in the two parts of the mailbox queue.
 To solve this problem, an atomic gulp also marks the mailbox queue as subdivided, making it ineligible for stealing.
-Hence, a thief checks for ineligible and non-empty before attempting an atomic steal of a queue,
-
-
+Hence, a thief checks if a queue is eligible and non-empty before attempting an atomic steal of a queue.
+
+Internally, the mailbox queues are accessed through two arrays of size $N$, that are shared among all workers.
+There is an array of mailbox queue objects @mailboxes@, and an array of pointers to the mailbox objects, @worker_queues@:
+\begin{cfa}
+work_queue * mailboxes;                   // master array of work request queues
+work_queue ** worker_queues;                // secondary array of work queues to allow for swapping
+\end{cfa}
+A diagram of the queue architecture and stealing mechanism is shown in Figure~\ref{f:steal}.
+
+When a send happens, it inserts a request directly into one of the @mailboxes@, but when a steal happens, two pointers in @worker_queues@ are swapped.
+Conceptually, @worker_queues@ represents the ownership relation between mailboxes and workers.
+Given $M$ workers and $N$ mailboxes, each worker owns a contiguous $M$/$N$ block of pointers in @worker_queues@.
+When a worker gulps, it dereferences one of the pointers in its section of @worker_queues@ and then gulps from the queue it points at.
+To transfer ownership of a mailbox from one worker to another, a pointer from each of the workers' ranges are swapped.
+This structure provides near-complete separation of stealing and gulping/send operations.
+As such, operations can happen on the queues independent of stealing, which avoids almost all contention between thief threads and victim threads.
 
 To steal a queue, a thief does the following:
 \begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
 \item
-chooses a victim, which is trivial because there is an array of $N$ pointers to the mailboxes (copy queues) that is subdivided into $M$/$N$ worker ranges \see{Section~\ref{s:executor}}.
-
-\item
-scan the victim's mailbox-range and test each mail queue in that range for a non-empty queue by performing a test-for-non-empty and swap.
-The conditional check reduces the number of atomic operations.
-The swap races to interchange the non-empty mail-queue pointer from the victim's range with an empty mail-queue pointer in the thief's range.
-Regardless, either a thief swaps or the victim gulps the mail queue;
-correctness is guaranteed, no matter which thread wins the race, because both threads are cycling through appropriate lists.
-Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
-
-\item
-stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
-The thief then returns to its scheduler and iterates over its messages queues, because new messages may have arrived during stealing, including a stolen queue.
-Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
+chooses a victim. Victim selection heuristics are independent of the stealing mechanism and will be discussed in Section~\ref{s:victimSelect}.
+
+\item
+scan the victim's $M$/$N$ range of @worker_queues@ and non-atomically checks each mail queue to see if it is eligible and non-empty.
+If one such mail queue is found, the thief attempts to steal the queue by swapping two pointers in @worker_queues@.
+The swap races to interchange the non-empty mail-queue pointer from the victim's @worker_queues@ range with an empty mail-queue pointer in the thief's @worker_queues@ range.
+This swap can fail if another thief steals the queue, which will be discussed further in Section~\ref{s:swap}.
+Note, a thief never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
+If no such mail queue is found, no swap is attempted.
+
+\item
+stops searching after a successful queue steal, a failed queue steal or all queues in the victim's range are examined.
+The thief then resumes normal execution and ceases being a thief; it returns to its own queues and iterates over them, because new messages may have arrived during stealing, including ones in the potentially stolen queue.
+Stealing is only repeated after the worker completes two consecutive iterations over its owned queues without finding work.
 \end{enumerate}
-This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
-As well, pushes to the queues by other workers can happen concurrently during the swap since pushes happen via the actor queue references.
-
-
-
-\begin{enumerate}[topsep=5pt,itemsep=3pt,parsep=0pt]
-\item
-selects a random worker's mailbox-range and tests each mail queue in that range for a non-empty queue.
-
-A queue can be stolen and gulped at the same time since they are independent operations.
-However, two workers cannot gulp from the same queue at the same time since that would violate the ordering and mutual exclusion guarantees of actors, so an avail flag is used on queues to stop this from happening.
-If a worker tries to gulp a queue and sees an avail flag up, the gulp fails, and it will move on to other queues and try to gulp it again later.
-stops searching after a non-empty queue steal or all queues in the random worker's range are examined.
-It stops searching after any queue steal (even an empty one) or all queues are examined.
-
-Stealing is only repeated after two consecutive iterations over its owned queues without finding work.
-Note, a thief can never exceeds its $M$/$N$ worker range because it is always exchanging queues with other workers.
-\end{enumerate}
-This approach largely eliminates contention among thieves and victims except for the rare moment when a thief and victim attempt to steal or gulp the same queue.
-The first and last sentence here are correct. I'm not sure I understand the second sentence.
-
-
-
-As an aside I think I need a diagram here since it is clear that the algorithm is not clear through my writing.
-I made a very quick one that I'll use here to explain, but I will work on one to add to the chapter.
-
-
-To give some accompanying code there are two arrays, the array of copy queues, and the array of pointers to the copy queues:
-\begin{cfa}
-work_queue * request_queues;                   // master array of work request queues
-work_queue ** worker_req_queues;                // secondary array of work queues to allow for swapping
-\end{cfa}
-\includegraphics[width=4in]{../figures/steal_diagram.eps}
-When a send happens, it inserts a request directly into one of the queues in @request_queues@, when a steal happens two pointers in @worker_req_queues@ are swapped.
-Each worker owns a section of @worker_req_queues@.
-When a worker gulps, it dereferences one of the pointers in its section of @worker_req_queues@ and then gulps from the queue it points at.
-As such, operations can happen on the queues directly independent of stealing, which avoids almost all contention between stealing threads and busy threads.
-
-If you want to poke some of the code, in @actor.hfa@ @steal_work()@ selects the victim, @choose_queue()@ iterates over the queues looking for non-empty queues and @try_swap_queues()@ performs the thread safe pointer swap.
-
-% The first key to this is that actors and workers maintain two distinct arrays of references to queues.
-% Actors will always receive messages via the same queues.
-% Workers, on the other hand will swap the pointers to queues in their shared array and operate on queues in the range of that array that they own.
-% Swapping queues is a matter of atomically swapping two pointers in the worker array.
-% As such pushes to the queues can happen concurrently during the swap since pushes happen via the actor queue references.
-
-% Gulping can also occur during queue swapping, but the implementation requires more nuance than the pushes.
-% When a worker is not stealing it iterates across its own range of queues and gulps them one by one.
-
-Correctness of gulping is slightly more complex with stealing.
-When a worker operates on a queue, it first copies the queue pointer from the mailbox array to a local variable.
+
+This approach largely eliminates contention among thieves and victims except for the rare moment when a thief steals a queue and the victim and thief concurrently attempt to gulp the same queue.
+When a victim operates on a queue, it first copies the queue pointer from @worker_queues@ to a local variable.
 It then uses that local variable for all queue operations until it moves to the next index of its range of the queue array.
 This ensures that any swaps do not interrupt gulping operations, however this introduces a correctness issue.
+There is a window for a race condition between the victim and a thief.
+Once a victim copies the queue pointer from @worker_queues@, a thief could steal that pointer and both may try to gulp from the same queue.
+These two gulps cannot be allowed to happen concurrently.
 If any behaviours from a queue are run by two workers at a time it violates both mutual exclusion and the actor ordering guarantees.
-As such this must be avoided.
-To avoid this each queue has a @being_processed@ flag that is atomically set to @true@ when a queue is gulped.
-The flag indicates that a queue is being processed locally and is set back to @false@ once the local processing is finished.
-If a worker attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range.
+To avoid concurrent gulps, each queue stores a @being_processed@ flag that is atomically set to @true@ when a queue is gulped.
+The flag indicates that a queue is being processed by a worker and is set back to @false@ once the local processing is finished.
+If a worker, either victim or thief, attempts to gulp a queue and finds that the @being_processed@ flag is @true@, it does not gulp the queue and moves on to the next queue in its range.
 This is a source of contention between victims and thieves since a thief may steal a queue and set @being_processed@ to @true@ between a victim saving a pointer to a queue and gulping it.
 However, the window for this race is very small, making this contention rare.
-This is why the claim is made that this mechanism is zero-victim-cost in practice but not in theory.
-By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs.
+This is why this mechanism is zero-victim-cost in practice but not in theory.
+By collecting statistics on failed gulps due to the @being_processed@ flag, it is found that this contention occurs ~0.05\% of the time when a gulp occurs (1 in every 2000 gulps). 
 Hence, the claim is made that this stealing mechanism has zero-victim-cost in practice.
 
-
-\subsection{Queue Swap Correctness}
+\subsection{Queue Pointer Swap}\label{s:swap}
+Two atomically swap two pointers in @worker_queues@, a novel wait-free swap algorithm is used.
+The novel wait-free swap is effectively a special case of a DCAS.
+DCAS stands for Double Compare-And-Swap, which is a more general version of the Compare-And-Swap (CAS) atomic operation~\cite{Doherty04}.
+CAS compares is a read-modify-write operation available on most modern architectures which atomically compares an expected value with a memory location.
+If the expected value and value in memory are equal it then writes a new value into the memory location.
+A sample software implemention of CAS follows.
+\begin{cfa}
+// assume this routine executes atomically
+bool CAS( val * ptr, val expected, val new ) {
+	if ( *ptr != expected )
+		return false;
+	*ptr = new;
+	return true;
+}
+\end{cfa}
+As shown CAS only operates on one memory location.
+Where CAS operates on a single memory location and some values, DCAS operates on two memory locations.
+\begin{cfa}
+// assume this routine executes atomically
+bool DCAS( val * addr1, val * addr2, val old1, val old2, val new1, val new2 ) {
+	if ( ( *addr1 == old1 ) && ( *addr2 == old2 ) ) {
+        *addr1 = new1;
+        *addr2 = new2;
+        return true;
+    }
+    return false;
+}
+\end{cfa}
+The DCAS implemented in this work is special cased in two ways.
+First of all, a DCAS is more powerful than what is needed to swap two pointers.
+A double atomic swap (DAS) is all that is needed for this work.
+The atomic swap provided by most modern hardware requires that at least one operand is a register.
+A DAS would relax that restriction so that both operands of the swap could be memory locations.
+As such a DAS can be written in terms of the DCAS above as follows.
+\begin{cfa}
+bool DAS( val * addr1, val * addr2 ) {
+    return DCAS( addr1, addr2, *addr1, *addr2, *addr2, *addr1 );
+}
+\end{cfa}
+The other special case is that the values being swapped will never null pointers.
+This allows the DAS implementation presented to use null pointers as intermediate values during the swap.
+
 Given the wait-free swap used is novel, it is important to show that it is correct.
-Firstly, it is clear to show that the swap is wait-free since all workers will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping.
+It is clear to show that the swap is wait-free since all thieves will fail or succeed in swapping the queues in a finite number of steps since there are no locks or looping.
 There is no retry mechanism in the case of a failed swap, since a failed swap either means the work was already stolen, or that work was stolen from the thief.
-In both cases it is apropos for a thief to given up on stealing.
+In both cases it is apropos for a thief to give up on stealing.
 \CFA-style pseudocode for the queue swap is presented below.
 The swap uses compare-and-swap (@CAS@) which is just pseudocode for C's @__atomic_compare_exchange_n@.
@@ -681,50 +695,43 @@
 void swap( uint victim_idx, uint my_idx ) {
 	// Step 0:
-	work_queue * my_queue = request_queues[my_idx];
-	work_queue * vic_queue = request_queues[victim_idx];
+	work_queue * my_queue = mailboxes[my_idx];
+	work_queue * vic_queue = mailboxes[victim_idx];
 	// Step 2:
-	request_queues[my_idx] = 0p;
+	mailboxes[my_idx] = 0p;
 	// Step 3:
-	request_queues[victim_idx] = my_queue;
+	mailboxes[victim_idx] = my_queue;
 	// Step 4:
-	request_queues[my_idx] = vic_queue;
-}
-\end{cfa}
-
-Step 1 is missing in the sequential example since in only matter in the concurrent context presented later.
+	mailboxes[my_idx] = vic_queue;
+}
+\end{cfa}
+
+Step 1 is missing in the sequential example since it only matters in the concurrent context presented later.
 By looking at the sequential swap it is easy to see that it is correct.
 Temporary copies of each pointer being swapped are stored, and then the original values of each pointer are set using the copy of the other pointer.
 
 \begin{cfa}
-// This routine is atomic
-bool CAS( work_queue ** ptr, work_queue ** old, work_queue * new ) {
-	if ( *ptr != *old )
-		return false;
-	*ptr = new;
-	return true;
-}
 
 bool try_swap_queues( worker & this, uint victim_idx, uint my_idx ) with(this) {
 	// Step 0:
-	// request_queues is the shared array of all sharded queues
-	work_queue * my_queue = request_queues[my_idx];
-	work_queue * vic_queue = request_queues[victim_idx];
+	// mailboxes is the shared array of all sharded queues
+	work_queue * my_queue = mailboxes[my_idx];
+	work_queue * vic_queue = mailboxes[victim_idx];
 
 	// Step 1:
-	// If either queue is 0p then they are in the process of being stolen
+	// If the victim queue is 0p then they are in the process of stealing so return false
 	// 0p is Cforall's equivalent of C++'s nullptr
 	if ( vic_queue == 0p ) return false;
 
 	// Step 2:
-	// Try to set thief's queue ptr to be 0p.
-	// If this CAS fails someone stole thief's queue so return false
-	if ( !CAS( &request_queues[my_idx], &my_queue, 0p ) )
+	// Try to set our own (thief's) queue ptr to be 0p.
+	// If this CAS fails someone stole our (thief's) queue so return false
+	if ( !CAS( &mailboxes[my_idx], &my_queue, 0p ) )
 		return false;
 
 	// Step 3:
-	// Try to set victim queue ptr to be thief's queue ptr.
+	// Try to set victim queue ptr to be our (thief's) queue ptr.
 	// If it fails someone stole the other queue, so fix up then return false
-	if ( !CAS( &request_queues[victim_idx], &vic_queue, my_queue ) ) {
-		request_queues[my_idx] = my_queue; // reset queue ptr back to prev val
+	if ( !CAS( &mailboxes[victim_idx], &vic_queue, my_queue ) ) {
+		mailboxes[my_idx] = my_queue; // reset queue ptr back to prev val
 		return false;
 	}
@@ -734,5 +741,5 @@
 	// Thief's ptr is 0p so no one will touch it
 	// Write back without CAS is safe
-	request_queues[my_idx] = vic_queue;
+	mailboxes[my_idx] = vic_queue;
 	return true;
 }