Index: doc/theses/colby_parsons_MMAth/text/actors.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 0aa77e6ab48fb3e593de734c3b00e37bf7ece4d1)
+++ doc/theses/colby_parsons_MMAth/text/actors.tex	(revision a2eb21a3e75fa9dd8fd465b9179ac734db266da5)
@@ -1161,4 +1161,116 @@
 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
 
+\subsection{Executor}\label{s:executorPerf}
+
+The microbenchmarks in this section are designed to stress the executor.
+The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of an actor workload.
+Three benchmarks are run: executor, repeat, and high-memory watermark.
+
+The executor benchmark creates 40,000 actors, organizes the actors into adjacent groups of 100, where an actor sends a message to each group member, including itself, in round-robin order, and repeats the sending cycle 400 times.
+This microbenchmark is designed to flood the executor with a large number of messages flowing among actors.
+Given there is no work associated with each message, other than sending more messages, the intended bottleneck of this experiment is the executor message send process.
+
+\begin{figure}
+	\centering
+	\subfloat[AMD Executor Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasusExecutor.pgf}}
+		\label{f:ExecutorAMD}
+	}
+	\subfloat[Intel Executor Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pykeExecutor.pgf}}
+		\label{f:ExecutorIntel}
+	}
+	\caption{Executor benchmark comparing actor systems (lower is better).}
+\end{figure}
+
+Figures~\ref{f:ExecutorIntel} and~\ref{f:ExecutorAMD} show the results of the AMD and Intel executor benchmark.
+There are three groupings of results, and the difference between AMD and Intel is small.
+CAF is significantly slower than the other actor systems; followed by a tight grouping of uC++, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.
+The difference in runtime between \uC and \CFA is largely due to the copy queue described in Section~\ref{s:copyQueue}.
+The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator.
+Additionally, due to the static typing in \CFA's actor system, there is no expensive dynamic (RTTI) casts that occur in \uC to discriminate messages types.
+Note, while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small;
+hence, the relative cost for the RTTI in \uC is significant.
+
+\begin{figure}
+	\centering
+	\subfloat[AMD Repeat Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
+		\label{f:RepeatAMD}
+	}
+	\subfloat[Intel Repeat Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
+		\label{f:RepeatIntel}
+	}
+	\caption{The repeat benchmark comparing actor systems (lower is better).}
+\end{figure}
+
+The repeat benchmark also evaluates the executor.
+It stresses the executor's ability to withstand contention on queues.
+The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
+The scatter and gather are repeats 200 times.
+The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
+As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
+
+Figures~\ref{f:RepeatAMD} and~\ref{f:RepeatIntel} show the results of the AMD and Intel repeat benchmark.
+The results are spread out more, and there is a difference between AMD and Intel.
+Again, CAF is significantly slower than the other actor systems.
+On the AMD there is a tight grouping of uC++, ProroActor, and Akka;
+on the Intel, uC++, ProroActor, and Akka are spread out.
+Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
+Here, gains from using the copy queue are much less apparent.
+
+\begin{table}
+	\centering
+	\setlength{\extrarowheight}{2pt}
+	\setlength{\tabcolsep}{5pt}
+
+	\caption{Executor Program Memory High Watermark}
+	\label{t:ExecutorMemory}
+	\begin{tabular}{*{5}{r|}r}
+		& \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
+		\hline
+		AMD		& \input{data/pykeExecutorMem} \\
+		\hline
+		Intel	& \input{data/nasusExecutorMem}
+	\end{tabular}
+\end{table}
+
+Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command..
+\CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.
+This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.
+Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
+The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
+
+\subsection{Matrix Multiply}
+
+The matrix-multiply benchmark evaluates the actor systems in a practical application, where actors concurrently multiply two matrices.
+In detail, given $Z_{m,r} = X_{m,n} \cdot Y_{n,r}$, the matrix multiply is defined as:
+\begin{displaymath}
+X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k}
+\end{displaymath}
+The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system.
+
+The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
+An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$.
+
+Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results.
+Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
+\uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
+As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
+
+\begin{figure}
+	\centering
+	\subfloat[AMD Matrix Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasusMatrix.pgf}}
+		\label{f:MatrixAMD}
+	}
+	\subfloat[Intel Matrix Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pykeMatrix.pgf}}
+		\label{f:MatrixIntel}
+	}
+	\caption{The matrix benchmark comparing actor systems (lower is better).}
+\end{figure}
+
 \subsection{Work Stealing}
 
@@ -1211,33 +1323,4 @@
 Note that the non stealing variation of balance-one will slow down marginally as the cores increase due to having to create more dummy actors on the inactive cores during startup.
 
-\subsection{Executor}\label{s:executorPerf}
-
-The microbenchmarks in this section are designed to stress the executor.
-The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of a workload.
-In the executor benchmark, 40,000 actors are created and each actor is placed into a group of 100, who send and receive 100 messages to/from each actors in their group.
-Each time an actor completes all their sends and receives, they are done a round.
-After all groups have completed 400 rounds the system terminates.
-This microbenchmark is designed to flood the executor with a large number of messages flowing between actors.
-Given there is no work associated with each message, other than sending more messages, the intended bottleneck of this experiment is the executor message send process.
-
-\begin{figure}
-	\centering
-	\subfloat[AMD Executor Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/nasusExecutor.pgf}}
-		\label{f:ExecutorAMD}
-	}
-	\subfloat[Intel Executor Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/pykeExecutor.pgf}}
-		\label{f:ExecutorIntel}
-	}
-	\caption{The executor benchmark comparing actor systems (lower is better).}
-\end{figure}
-
-The results of the executor benchmark in Figures~\ref{f:ExecutorIntel} and \ref{f:ExecutorAMD} show \CFA with the lowest runtime relative to its peers.
-The difference in runtime between \uC and \CFA is largely due to the usage of the copy queue described in Section~\ref{s:copyQueue}.
-The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator.
-Additionally, due to the static typing in \CFA's actor system, it is able to get rid of expensive dynamic casts that occur in \uC to discriminate messages by type.
-Note that dynamic casts are usually not very expensive, but relative to the high performance of the rest of the implementation of the \uC actor system, the cost is significant.
-
 \begin{figure}
 	\centering
@@ -1257,26 +1340,4 @@
 \begin{figure}
 	\centering
-	\subfloat[AMD Repeat Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
-		\label{f:RepeatAMD}
-	}
-	\subfloat[Intel Repeat Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
-		\label{f:RepeatIntel}
-	}
-	\caption{The repeat benchmark comparing actor systems (lower is better).}
-\end{figure}
-
-The repeat microbenchmark also evaluates the executor.
-It stresses the executor's ability to withstand contention on queues, as it repeatedly fans out messages from a single client to 100000 servers who then all respond to the client.
-After this scatter and gather repeats 200 times the benchmark terminates.
-The messages from the servers to the client will likely all come in on the same queue, resulting in high contention.
-As such this benchmark will not scale with the number of processors, since more processors will result in higher contention.
-In Figure~\ref{f:RepeatAMD} we can see that \CFA performs well compared to \uC, however by less of a margin than the executor benchmark.
-One factor in this result is that the contention on the queues poses a significant bottleneck.
-As such the gains from using the copy queue are much less apparent.
-
-\begin{figure}
-	\centering
 	\subfloat[AMD \CFA Repeat Benchmark]{
 		\resizebox{0.5\textwidth}{!}{\input{figures/nasusCFARepeat.pgf}}
@@ -1289,10 +1350,4 @@
 	\caption{The repeat benchmark comparing \CFA stealing heuristics (lower is better).}
 \end{figure}
-
-In Figure~\ref{f:RepeatIntel} \uC and \CFA are very comparable.
-In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
-The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
-When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
-As such stealing the client can result in a hit in performance.
 
 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
@@ -1307,55 +1362,11 @@
 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
 
-\begin{table}[t]
-	\centering
-	\setlength{\extrarowheight}{2pt}
-	\setlength{\tabcolsep}{5pt}
-
-	\caption{Executor Program Memory High Watermark}
-	\label{t:ExecutorMemory}
-	\begin{tabular}{*{5}{r|}r}
-		& \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
-		\hline
-		AMD		& \input{data/pykeExecutorMem} \\
-		\hline
-		Intel	& \input{data/nasusExecutorMem}
-	\end{tabular}
-\end{table}
-
-Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores.
-\CFA has a high watermark relative to the other non-garbage collected systems \uC, and CAF.
-This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed.
-Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems.
-
-\subsection{Matrix Multiply}
-The matrix benchmark evaluates the actor systems in a practical application, where actors concurrently multiplies two matrices.
-The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the executor or message sending system.
-
-Given $Z_{m,r} = X_{m,n} \cdot Y_{n,r}$, the matrix multiply is defined as:
-\begin{displaymath}
-X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k}
-\end{displaymath}
-
-The benchmark uses input matrices $X$ and $Y$ that are both $3072$ by $3072$ in size.
-An actor is made for each row of $X$ and is passed via message the information needed to calculate a row of the result matrix $Z$.
-
-
-Given that the bottleneck of the benchmark is the computation of the result matrix, it follows that the results in Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} are clustered closer than other experiments.
-In Figure~\ref{f:MatrixAMD} \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
-As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
+
+In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
+The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
+When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
+As such stealing the client can result in a hit in performance.
+
 In Figures~\ref{f:cfaMatrixAMD} and \ref{f:cfaMatrixIntel} there is little negligible performance difference across \CFA stealing heuristics.
-
-\begin{figure}
-	\centering
-	\subfloat[AMD Matrix Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/nasusMatrix.pgf}}
-		\label{f:MatrixAMD}
-	}
-	\subfloat[Intel Matrix Benchmark]{
-		\resizebox{0.5\textwidth}{!}{\input{figures/pykeMatrix.pgf}}
-		\label{f:MatrixIntel}
-	}
-	\caption{The matrix benchmark comparing actor systems (lower is better).}
-\end{figure}
 
 \begin{figure}