Index: doc/theses/colby_parsons_MMAth/text/actors.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/actors.tex	(revision f80e0f1ec5c1d6ca89e521bef2589278610f15b4)
+++ doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 614868b3cb2ae1807fc194c27193d919b549415a)
@@ -1084,5 +1084,4 @@
 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions.
-% C_TODO cite actor paper
 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07.
 Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed.
@@ -1096,7 +1095,7 @@
 
 The benchmarks are run on 1--48 cores.
-On the Intel, with 24 core sockets, there is the choice to either hopping sockets or using hyperthreads on the same socket.
+On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket.
 Either choice causes a blip in performance, which is seen in the subsequent performance graphs.
-The choice is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
+The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
 
 All benchmarks are run 5 times and the median is taken.
@@ -1159,5 +1158,5 @@
 However, Akka and ProtoActor, slow down by two-orders of magnitude.
 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation.
-Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
+Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
 
 \subsection{Executor}\label{s:executorPerf}
@@ -1209,5 +1208,5 @@
 It stresses the executor's ability to withstand contention on queues.
 The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
-The scatter and gather are repeats 200 times.
+The scatter and gather repeats 200 times.
 The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
 As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
@@ -1219,5 +1218,7 @@
 on the Intel, uC++, ProroActor, and Akka are spread out.
 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
-Here, gains from using the copy queue are much less apparent.
+This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
+The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}.
+Here, gains from using the copy queue are much less apparent, due to the costs of stealing.
 
 \begin{table}
@@ -1258,5 +1259,5 @@
 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
-As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
+It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}.
 
 \begin{figure}
@@ -1273,5 +1274,5 @@
 \end{figure}
 
-\subsection{Work Stealing}
+\subsection{Work Stealing}\label{s:steal_perf}
 
 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}.
@@ -1352,4 +1353,7 @@
 
 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
+As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around.
+If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache.
+This steal is likely to happen since there is little other work in the system between scatter/gather rounds.
 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
@@ -1360,6 +1364,6 @@
 This hypothesis stems from experimentation with \CFA.
 CAF uses a randomized work stealing heuristic.
-In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
-
+Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
+This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
 
 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
