Context Navigation

-                      r432e1de
+                      rbcc56c9
 Error bars showing the 95\% confidence intervals appear on each point in the graphs.
 If the confidence bars are small enough, they may be obscured by the data point.
 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the \uC's actor system.
+In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of \uC's actor system.
 As such, the performance differences that arise are largely due to the contributions of this work.
 Future work is to port some of the new \CFA work back to \uC.
 …
 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
+As future work, tuning parameters can be provided to adjust the frequency and/or size of the copy-queue expansion.
 \begin{table}
 …
         \label{t:ExecutorMemory}
         \begin{tabular}{*{5}{r|}r}
                 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
+                & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c@{}}{ProtoActor} \\
                 \hline
                 AMD             & \input{data/pykeExecutorMem} \\
 …
 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early.
 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
+The matrix-multiply benchmark has input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
 An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$.
 Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries.
 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results.
+Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
+\uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
+It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}.
+There are two groupings with Akka and ProtoActor being slightly slower than \uC, \CFA, and CAF.
+On the Intel, there is an unknown divergence between \uC and \CFA/CAF at 24 cores.
+Given that the bottleneck of this benchmark is the computation of the result matrix, all executors perform well on this embarrassingly parallel application.
+Hence, the results are tightly clustered across all actor systems.
+This result also suggests CAF has a good executor but poor message passing, which results in its poor performance in the other message-passing benchmarks.
 \begin{figure}
 …
 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading.
 Here, the longest-victim and random heuristic are the same.
 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
+Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few \emph{dummy} actors are created for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim.
 This result is because the longest victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
+The reason is that the longest-victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
 Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}.
 …
 The single actor (the client) of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
 This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds.
+This worst-case steal is likely to happen since there is no other work in the system between scatter/gather rounds.
 However, all heuristics are comparable in performance on the repeat benchmark.
 This result is surprising especially for the No-Stealing variant, which one would expect to have better performance than the stealing variants.
 This is not the case, since the stealing happens lazily and fails fast, the queue containing the long-running client actor is rarely stolen.
 Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively.
 In both benchmarks, CAF performs poorly.
 It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
 This results in the poor performance with small messages containing little work per message in both of these benchmarks.
 In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
+This result is surprising especially for the No-Stealing variant, which should have better performance than the stealing variants.
+However, stealing happens lazily and fails fast, hence the queue containing the long-running client actor is rarely stolen.
+% Work stealing performance can be further analyzed by \emph{reexamining} the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}.
+% In both benchmarks, CAF performs poorly.
+% It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
+% This results in the poor performance with small messages containing little work per message in both of these benchmarks.
+% In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark.
+Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message.
+Stealing can still improve performance marginally in the matrix-multiply benchmark.
+In \ref{f:MatrixAMD} CAF performs better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster.
+This hypothesis stems from experimentation with \CFA.
+CAF uses a randomized work stealing heuristic.
+Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
+This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks.
+Here, there is negligible performance difference across stealing heuristics, because of the long-running workload of each message.
+In theory, work stealing might improve performance marginally for the matrix-multiply benchmark.
+Since all row actors cannot be created simultaneously at startup, they correspondingly do not shutdown simultaneously.
+Hence, there is a small window at the start and end with idle workers so work stealing might improve performance.
+For example, in \ref{f:MatrixAMD}, CAF is slightly better than \uC and \CFA, but not on the Intel.
+Hence, it is difficult to attribute the AMD gain to the aggressive work stealing in CAF.
 \begin{figure}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset bcc56c9 for doc/theses/colby_parsons_MMAth/text/actors.tex

Legend:

doc/theses/colby_parsons_MMAth/text/actors.tex

Download in other formats: