Ignore:
Timestamp:
Jul 17, 2023, 9:22:03 AM (12 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
847ab8f
Parents:
432e1de
Message:

final proofread of actor chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    r432e1de rbcc56c9  
    11071107Error bars showing the 95\% confidence intervals appear on each point in the graphs.
    11081108If the confidence bars are small enough, they may be obscured by the data point.
    1109 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the \uC's actor system.
     1109In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of \uC's actor system.
    11101110As such, the performance differences that arise are largely due to the contributions of this work.
    11111111Future work is to port some of the new \CFA work back to \uC.
     
    12311231Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
    12321232The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
     1233As future work, tuning parameters can be provided to adjust the frequency and/or size of the copy-queue expansion.
    12331234
    12341235\begin{table}
     
    12401241        \label{t:ExecutorMemory}
    12411242        \begin{tabular}{*{5}{r|}r}
    1242                 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
     1243                & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c@{}}{ProtoActor} \\
    12431244                \hline
    12441245                AMD             & \input{data/pykeExecutorMem} \\
     
    12571258The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early.
    12581259
    1259 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
     1260The matrix-multiply benchmark has input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
    12601261An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$.
    12611262Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries.
    12621263
    12631264Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results.
    1264 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
    1265 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
    1266 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}.
     1265There are two groupings with Akka and ProtoActor being slightly slower than \uC, \CFA, and CAF.
     1266On the Intel, there is an unknown divergence between \uC and \CFA/CAF at 24 cores.
     1267Given that the bottleneck of this benchmark is the computation of the result matrix, all executors perform well on this embarrassingly parallel application.
     1268Hence, the results are tightly clustered across all actor systems.
     1269This result also suggests CAF has a good executor but poor message passing, which results in its poor performance in the other message-passing benchmarks.
    12671270
    12681271\begin{figure}
     
    13251328On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading.
    13261329Here, the longest-victim and random heuristic are the same.
    1327 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
     1330Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few \emph{dummy} actors are created for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
    13281331
    13291332For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim.
    1330 This result is because the longest victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
     1333The reason is that the longest-victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.
    13311334Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}.
    13321335
     
    13661369The single actor (the client) of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
    13671370When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
    1368 This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds.
     1371This worst-case steal is likely to happen since there is no other work in the system between scatter/gather rounds.
    13691372However, all heuristics are comparable in performance on the repeat benchmark.
    1370 This result is surprising especially for the No-Stealing variant, which one would expect to have better performance than the stealing variants.
    1371 This is not the case, since the stealing happens lazily and fails fast, the queue containing the long-running client actor is rarely stolen.
    1372 
    1373 Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively.
    1374 In both benchmarks, CAF performs poorly.
    1375 It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
    1376 This results in the poor performance with small messages containing little work per message in both of these benchmarks.
    1377 In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
     1373This result is surprising especially for the No-Stealing variant, which should have better performance than the stealing variants.
     1374However, stealing happens lazily and fails fast, hence the queue containing the long-running client actor is rarely stolen.
     1375
     1376% Work stealing performance can be further analyzed by \emph{reexamining} the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}.
     1377% In both benchmarks, CAF performs poorly.
     1378% It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
     1379% This results in the poor performance with small messages containing little work per message in both of these benchmarks.
     1380% In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
    13781381
    13791382Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark.
    1380 Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message.
    1381 
    1382 Stealing can still improve performance marginally in the matrix-multiply benchmark.
    1383 In \ref{f:MatrixAMD} CAF performs better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster.
    1384 This hypothesis stems from experimentation with \CFA.
    1385 CAF uses a randomized work stealing heuristic.
    1386 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
    1387 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks.
     1383Here, there is negligible performance difference across stealing heuristics, because of the long-running workload of each message.
     1384
     1385In theory, work stealing might improve performance marginally for the matrix-multiply benchmark.
     1386Since all row actors cannot be created simultaneously at startup, they correspondingly do not shutdown simultaneously.
     1387Hence, there is a small window at the start and end with idle workers so work stealing might improve performance.
     1388For example, in \ref{f:MatrixAMD}, CAF is slightly better than \uC and \CFA, but not on the Intel.
     1389Hence, it is difficult to attribute the AMD gain to the aggressive work stealing in CAF.
    13881390
    13891391\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.