Changeset 60a9164


Ignore:
Timestamp:
Jul 13, 2023, 4:04:18 PM (17 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
a3c7bac
Parents:
cc28153d
Message:

update discussion about work stealing

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    rcc28153d r60a9164  
    12861286
    12871287Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA.
    1288 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers to load actors only on specific cores (there is one worker per core).
     1288These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by loading actors only on specific cores (there is one worker per core).
    12891289The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds.
    12901290
     
    13261326On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading.
    13271327Here, the longest-victim and random heuristic are the same.
    1328 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few "dummy" actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
     1328Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.
    13291329
    13301330For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim.
     
    13431343        }
    13441344        \caption{Executor benchmark comparing \CFA stealing heuristics (lower is better).}
     1345        \label{f:ExecutorBenchmark}
    13451346\end{figure}
    13461347
     
    13591360        }
    13601361        \caption{The repeat benchmark comparing \CFA stealing heuristics (lower is better).}
     1362        \label{f:RepeatBenchmark}
    13611363\end{figure}
    13621364
     
    13651367The worst-case scenario is if the actor doing the majority of work or its mail queue is stolen by the work stealing system, as this incurs a huge cost to move the work and refill the local cache.
    13661368This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds.
    1367 
    1368 All \CFA variants are comparable in performance on the repeat benchmark.
    1369 This is surprising, since this is a pathological case for work stealing, and one would expect the No-Stealing variant to have better performance.
    1370 It is likely that the \CFA No-Stealing variant may be minorly impacted by other design decisions in the \CFA actor system that were made with work stealing in mind.
    1371 Work stealing performance in the repeat benchmark can be further examined by discussing Figures \ref{f:RepeatAMD} and \ref{f:RepeatIntel}.
    1372 
    1373 In particular, on the Intel machine in Figure~\ref{f:RepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA when comparing the AMD results in Figure~\ref{f:RepeatAMD} results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
     1369However, all heuristics are comparable in performance on the repeat benchmark.
     1370This result is surprising especially for the No-Stealing variant, which should have better performance than the stealing variants.
     1371It is likely the No-Stealing variant is impacted by other design decisions in the \CFA actor system related to work stealing.
     1372
     1373Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively.
     1374In both, benchmarks CAF performs poorly.
     1375It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.
     1376This results in the poor performance with small messages containing little work per message in both of these benchmarks.
     1377In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.
     1378
     1379\PAB{In particular, on the Intel machine in Figure~\ref{f:RepeatIntel}, the cost of stealing is significantly higher, which can be seen in the vertical shift of Akka, CAF and \CFA compared to the AMD results in Figure~\ref{f:RepeatAMD} (\uC and ProtoActor do not have work stealing).
    13741380The shift for CAF is particularly large, which supports the hypothesis that CAF's work stealing is particularly eager.
    1375 In both the executor and the repeat benchmark CAF performs poorly.
    1376 It is hypothesized that CAF has an aggressive work stealing algorithm, that eagerly attempts to steal.
    1377 This results in poor performance in benchmarks with small messages containing little work per message.
    1378 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
    13791381The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
    13801382When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
    1381 As such stealing the client can result in a hit in performance.
     1383As such stealing the client can result in a hit in performance.}
    13821384
    13831385Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark.
Note: See TracChangeset for help on using the changeset viewer.