Changeset bcc56c9
- Timestamp:
- Jul 17, 2023, 9:22:03 AM (21 months ago)
- Branches:
- master
- Children:
- 847ab8f
- Parents:
- 432e1de
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
TabularUnified doc/theses/colby_parsons_MMAth/text/actors.tex ¶
r432e1de rbcc56c9 1107 1107 Error bars showing the 95\% confidence intervals appear on each point in the graphs. 1108 1108 If the confidence bars are small enough, they may be obscured by the data point. 1109 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of the\uC's actor system.1109 In this section, \uC is compared to \CFA frequently, as the actor system in \CFA is heavily based off of \uC's actor system. 1110 1110 As such, the performance differences that arise are largely due to the contributions of this work. 1111 1111 Future work is to port some of the new \CFA work back to \uC. … … 1231 1231 Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate. 1232 1232 The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance. 1233 As future work, tuning parameters can be provided to adjust the frequency and/or size of the copy-queue expansion. 1233 1234 1234 1235 \begin{table} … … 1240 1241 \label{t:ExecutorMemory} 1241 1242 \begin{tabular}{*{5}{r|}r} 1242 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{ CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\1243 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c@{}}{ProtoActor} \\ 1243 1244 \hline 1244 1245 AMD & \input{data/pykeExecutorMem} \\ … … 1257 1258 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system, and might trigger some work stealing if a worker finishes early. 1258 1259 1259 The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.1260 The matrix-multiply benchmark has input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size. 1260 1261 An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$. 1261 1262 Because $Z$ is contiguous in memory, there can be small cache write-contention at the row boundaries. 1262 1263 1263 1264 Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results. 1264 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems. 1265 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading. 1266 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which is discussed further in Section~\ref{s:steal_perf}. 1265 There are two groupings with Akka and ProtoActor being slightly slower than \uC, \CFA, and CAF. 1266 On the Intel, there is an unknown divergence between \uC and \CFA/CAF at 24 cores. 1267 Given that the bottleneck of this benchmark is the computation of the result matrix, all executors perform well on this embarrassingly parallel application. 1268 Hence, the results are tightly clustered across all actor systems. 1269 This result also suggests CAF has a good executor but poor message passing, which results in its poor performance in the other message-passing benchmarks. 1267 1270 1268 1271 \begin{figure} … … 1325 1328 On Intel in Figure~\ref{f:BalanceOneIntel}, above 32 cores the performance gets worse for all variants due to hyperthreading. 1326 1329 Here, the longest-victim and random heuristic are the same. 1327 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few ``dummy'' actors need to be madefor each of the extra cores beyond the first to adversarially layout all loaded actors on the first core.1330 Note, the non-stealing variation of balance-one slows down slightly (no decrease in graph) as the cores increase, since a few \emph{dummy} actors are created for each of the extra cores beyond the first to adversarially layout all loaded actors on the first core. 1328 1331 1329 1332 For the balance-multi benchmark in Figures~\ref{f:BalanceMultiAMD} and~\ref{f:BalanceMultiIntel}, the random heuristic outperforms the longest victim. 1330 Th is result is because the longestvictim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing.1333 The reason is that the longest-victim heuristic has a higher stealing cost as it needs to maintain timestamps and look at all timestamps before stealing. 1331 1334 Additionally, a performance cost on the Intel is observed when hyperthreading kicks in after 24 cores in Figure~\ref{f:BalanceMultiIntel}. 1332 1335 … … 1366 1369 The single actor (the client) of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers. 1367 1370 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation. 1368 This worst-case steal is likely to happen since there is littleother work in the system between scatter/gather rounds.1371 This worst-case steal is likely to happen since there is no other work in the system between scatter/gather rounds. 1369 1372 However, all heuristics are comparable in performance on the repeat benchmark. 1370 This result is surprising especially for the No-Stealing variant, which one would expect tohave better performance than the stealing variants.1371 This is not the case, since the stealing happens lazily and fails fast,the queue containing the long-running client actor is rarely stolen.1372 1373 Work stealing performance can be further analyzed by reexamining the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}, respectively.1374 In both benchmarks, CAF performs poorly.1375 It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal.1376 This results in the poor performance with small messages containing little work per message in both of these benchmarks.1377 In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing.1373 This result is surprising especially for the No-Stealing variant, which should have better performance than the stealing variants. 1374 However, stealing happens lazily and fails fast, hence the queue containing the long-running client actor is rarely stolen. 1375 1376 % Work stealing performance can be further analyzed by \emph{reexamining} the executor and repeat benchmarks in Figures~\ref{f:ExecutorBenchmark} and \ref{f:RepeatBenchmark}. 1377 % In both benchmarks, CAF performs poorly. 1378 % It is hypothesized that CAF has an aggressive work stealing algorithm that eagerly attempts to steal. 1379 % This results in the poor performance with small messages containing little work per message in both of these benchmarks. 1380 % In comparison with the other systems, \uC does well on both benchmarks since it does not have work stealing. 1378 1381 1379 1382 Finally, Figures~\ref{f:cfaMatrixAMD} and~\ref{f:cfaMatrixIntel} show the effects of the stealing heuristics for the matrix-multiply benchmark. 1380 Here, there is negligible performance difference across stealing heuristics, likely due to the long running workload of each message. 1381 1382 Stealing can still improve performance marginally in the matrix-multiply benchmark. 1383 In \ref{f:MatrixAMD} CAF performs better; few messages are sent, so the eager work stealing allows for the clean up of loose ends to occur faster. 1384 This hypothesis stems from experimentation with \CFA. 1385 CAF uses a randomized work stealing heuristic. 1386 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark. 1387 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message, which may partially explain CAF's poor performance on other benchmarks. 1383 Here, there is negligible performance difference across stealing heuristics, because of the long-running workload of each message. 1384 1385 In theory, work stealing might improve performance marginally for the matrix-multiply benchmark. 1386 Since all row actors cannot be created simultaneously at startup, they correspondingly do not shutdown simultaneously. 1387 Hence, there is a small window at the start and end with idle workers so work stealing might improve performance. 1388 For example, in \ref{f:MatrixAMD}, CAF is slightly better than \uC and \CFA, but not on the Intel. 1389 Hence, it is difficult to attribute the AMD gain to the aggressive work stealing in CAF. 1388 1390 1389 1391 \begin{figure}
Note: See TracChangeset
for help on using the changeset viewer.