Changeset d6d2136 for doc/theses/colby_parsons_MMAth
- Timestamp:
- Jul 12, 2023, 12:49:05 PM (18 months ago)
- Branches:
- master
- Children:
- cc28153d
- Parents:
- e0069bd
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/colby_parsons_MMAth/text/actors.tex
re0069bd rd6d2136 1002 1002 The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work. 1003 1003 The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker. 1004 Thieves then attempt to steal from the worker with the oldest timestamp. 1005 \PAB{How does a thief find the oldest timestamp?} 1004 Thieves then attempt to steal from the worker with the oldest timestamp, which is found by performing a linear search across the array of timestamps. 1006 1005 The intuition behind this heuristic is that the slowest worker receives help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers. 1007 1006 This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work. … … 1212 1211 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel. 1213 1212 This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather. 1214 The impact of work stealing on this benchmark arediscussed further in Section~\ref{s:steal_perf}.1213 The impact of work stealing on this benchmark is discussed further in Section~\ref{s:steal_perf}. 1215 1214 Here, gains from using the copy queue are much less apparent, due to the costs of stealing. 1216 1215 … … 1287 1286 1288 1287 Two pathological unbalanced cases are created, and compared using vanilla and randomized work stealing in \CFA. 1289 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers by isolating the receive and send actors on different cores.1288 These benchmarks adversarially takes advantage of the round-robin assignment of actors to workers to load actors only on specific cores (there is one worker per core). 1290 1289 The workload on the loaded cores is the same as the executor benchmark described in \ref{s:executorPerf}, but with fewer rounds. 1291 1290 … … 1327 1326 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1328 1327 Here, the longest-victim and random heuristic are the same. 1329 Note, the non-stealing variation of balance-one slows down (no decrease in graph) as the cores increase \PAB{not sure I understand this part: due to having to create more dummy actors on the inactive cores during startup}.1328 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few "dummy" actors need to be made for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core. 1330 1329 1331 1330 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim. … … 1362 1361 \end{figure} 1363 1362 1364 \PAB{Something is wrong here because there are no graphs containing all the benchmarks.1365 1363 Figures~\ref{f:cfaRepeatAMD} and~\ref{f:cfaRepeatIntel} show the effects of the stealing heuristics for the repeat benchmark. 1366 % Here, the no-stealing version of \CFA performs better than both stealing variations.1367 1364 As mentioned, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1368 The worst-case scenario is if the actor doing the majority of work or its mail queue is stolen by the work stealing system, as this a huge cost to move the work and refill the local cache.1365 The worst-case scenario is if the actor doing the majority of work or its mail queue is stolen by the work stealing system, as this incurs a huge cost to move the work and refill the local cache. 1369 1366 This worst-case steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1370 1367 1371 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1372 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager. 1368 All \CFA variants are comparable in performance on the repeat benchmark. 1369 This is surprising, since this is a pathological case for work stealing, and one would expect the No-Stealing variant to have better performance. 1370 It is likely that the \CFA No-Stealing variant may be minorly impacted by other design decisions in the \CFA actor system that were made with work stealing in mind. 1371 Work stealing performance in the repeat benchmark can be further examined by discussing Figures \ref{f:RepeatAMD} and \ref{f:RepeatIntel}. 1372 1373 In particular, on the Intel machine in Figure~\ref{f:RepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA when comparing the AMD results in Figure~\ref{f:RepeatAMD} results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1374 The shift for CAF is particularly large, which supports the hypothesis that CAF's work stealing is particularly eager. 1373 1375 In both the executor and the repeat benchmark CAF performs poorly. 1374 1376 It is hypothesized that CAF has an aggressive work stealing algorithm, that eagerly attempts to steal. 1375 1377 This results in poor performance in benchmarks with small messages containing little work per message. 1376 On the other hand, in \ref{f:MatrixAMD} CAF performs much better since each message has a large amount of work, and few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster. 1378 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing. 1379 The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1380 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1381 As such stealing the client can result in a hit in performance. 1382 1383 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark. 1384 Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message. 1385 1386 Stealing can still improve performance marginally in the matrix-multiply benchmark. 1387 In \ref{f:MatrixAMD} CAF performs much better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster. 1377 1388 This hypothesis stems from experimentation with \CFA. 1378 1389 CAF uses a randomized work stealing heuristic. 1379 1390 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark. 1380 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message. 1381 1382 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing. 1383 The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1384 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1385 As such stealing the client can result in a hit in performance.} 1386 1387 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiple benchmark. 1388 Here, there is negligible performance difference across stealing heuristics. 1391 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks. 1389 1392 1390 1393 \begin{figure}
Note: See TracChangeset
for help on using the changeset viewer.