Changeset f80e0f1e


Ignore:
Timestamp:
Jul 10, 2023, 9:20:51 PM (13 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
614868b, a2eb21a
Parents:
0aa77e6
Message:

more proofreading of actor chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    r0aa77e6 rf80e0f1e  
    11611161Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
    11621162
     1163\subsection{Executor}\label{s:executorPerf}
     1164
     1165The microbenchmarks in this section are designed to stress the executor.
     1166The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of an actor workload.
     1167Three benchmarks are run: executor, repeat, and high-memory watermark.
     1168
     1169The executor benchmark creates 40,000 actors, organizes the actors into adjacent groups of 100, where an actor sends a message to each group member, including itself, in round-robin order, and repeats the sending cycle 400 times.
     1170This microbenchmark is designed to flood the executor with a large number of messages flowing among actors.
     1171Given there is no work associated with each message, other than sending more messages, the intended bottleneck of this experiment is the executor message send process.
     1172
     1173\begin{figure}
     1174        \centering
     1175        \subfloat[AMD Executor Benchmark]{
     1176                \resizebox{0.5\textwidth}{!}{\input{figures/nasusExecutor.pgf}}
     1177                \label{f:ExecutorAMD}
     1178        }
     1179        \subfloat[Intel Executor Benchmark]{
     1180                \resizebox{0.5\textwidth}{!}{\input{figures/pykeExecutor.pgf}}
     1181                \label{f:ExecutorIntel}
     1182        }
     1183        \caption{Executor benchmark comparing actor systems (lower is better).}
     1184\end{figure}
     1185
     1186Figures~\ref{f:ExecutorIntel} and~\ref{f:ExecutorAMD} show the results of the AMD and Intel executor benchmark.
     1187There are three groupings of results, and the difference between AMD and Intel is small.
     1188CAF is significantly slower than the other actor systems; followed by a tight grouping of uC++, ProroActor, and Akka; and finally \CFA with the lowest runtime relative to its peers.
     1189The difference in runtime between \uC and \CFA is largely due to the copy queue described in Section~\ref{s:copyQueue}.
     1190The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator.
     1191Additionally, due to the static typing in \CFA's actor system, there is no expensive dynamic (RTTI) casts that occur in \uC to discriminate messages types.
     1192Note, while dynamic cast is relatively inexpensive, the remaining send cost in both \uC and \CFA is small;
     1193hence, the relative cost for the RTTI in \uC is significant.
     1194
     1195\begin{figure}
     1196        \centering
     1197        \subfloat[AMD Repeat Benchmark]{
     1198                \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
     1199                \label{f:RepeatAMD}
     1200        }
     1201        \subfloat[Intel Repeat Benchmark]{
     1202                \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
     1203                \label{f:RepeatIntel}
     1204        }
     1205        \caption{The repeat benchmark comparing actor systems (lower is better).}
     1206\end{figure}
     1207
     1208The repeat benchmark also evaluates the executor.
     1209It stresses the executor's ability to withstand contention on queues.
     1210The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
     1211The scatter and gather are repeats 200 times.
     1212The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
     1213As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
     1214
     1215Figures~\ref{f:RepeatAMD} and~\ref{f:RepeatIntel} show the results of the AMD and Intel repeat benchmark.
     1216The results are spread out more, and there is a difference between AMD and Intel.
     1217Again, CAF is significantly slower than the other actor systems.
     1218On the AMD there is a tight grouping of uC++, ProroActor, and Akka;
     1219on the Intel, uC++, ProroActor, and Akka are spread out.
     1220Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
     1221Here, gains from using the copy queue are much less apparent.
     1222
     1223\begin{table}
     1224        \centering
     1225        \setlength{\extrarowheight}{2pt}
     1226        \setlength{\tabcolsep}{5pt}
     1227
     1228        \caption{Executor Program Memory High Watermark}
     1229        \label{t:ExecutorMemory}
     1230        \begin{tabular}{*{5}{r|}r}
     1231                & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
     1232                \hline
     1233                AMD             & \input{data/pykeExecutorMem} \\
     1234                \hline
     1235                Intel   & \input{data/nasusExecutorMem}
     1236        \end{tabular}
     1237\end{table}
     1238
     1239Table~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores measured using the @time@ command..
     1240\CFA's high watermark is slightly higher than the other non-garbage collected systems \uC and CAF.
     1241This increase is from the over-allocation in the copy-queue data-structure with lazy deallocation.
     1242Whereas, the per envelope allocations of \uC and CFA allocate exactly the amount of storage needed and eagerly deallocate.
     1243The extra storage is the standard tradeoff of time versus space, where \CFA shows better performance.
     1244
     1245\subsection{Matrix Multiply}
     1246
     1247The matrix-multiply benchmark evaluates the actor systems in a practical application, where actors concurrently multiply two matrices.
     1248In detail, given $Z_{m,r} = X_{m,n} \cdot Y_{n,r}$, the matrix multiply is defined as:
     1249\begin{displaymath}
     1250X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k}
     1251\end{displaymath}
     1252The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the message sending system.
     1253
     1254The matrix-multiply benchmark uses input matrices $X$ and $Y$, which are both $3072$ by $3072$ in size.
     1255An actor is made for each row of $X$ and sent a message indicating the row of $X$ and the column of $Y$ to calculate a row of the result matrix $Z$.
     1256
     1257Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} show the matrix multiple results.
     1258Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
     1259\uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
     1260As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
     1261
     1262\begin{figure}
     1263        \centering
     1264        \subfloat[AMD Matrix Benchmark]{
     1265                \resizebox{0.5\textwidth}{!}{\input{figures/nasusMatrix.pgf}}
     1266                \label{f:MatrixAMD}
     1267        }
     1268        \subfloat[Intel Matrix Benchmark]{
     1269                \resizebox{0.5\textwidth}{!}{\input{figures/pykeMatrix.pgf}}
     1270                \label{f:MatrixIntel}
     1271        }
     1272        \caption{The matrix benchmark comparing actor systems (lower is better).}
     1273\end{figure}
     1274
    11631275\subsection{Work Stealing}
    11641276
     
    12111323Note that the non stealing variation of balance-one will slow down marginally as the cores increase due to having to create more dummy actors on the inactive cores during startup.
    12121324
    1213 \subsection{Executor}\label{s:executorPerf}
    1214 
    1215 The microbenchmarks in this section are designed to stress the executor.
    1216 The executor is the scheduler of an actor system and is responsible for organizing the interaction of executor threads to service the needs of a workload.
    1217 In the executor benchmark, 40,000 actors are created and each actor is placed into a group of 100, who send and receive 100 messages to/from each actors in their group.
    1218 Each time an actor completes all their sends and receives, they are done a round.
    1219 After all groups have completed 400 rounds the system terminates.
    1220 This microbenchmark is designed to flood the executor with a large number of messages flowing between actors.
    1221 Given there is no work associated with each message, other than sending more messages, the intended bottleneck of this experiment is the executor message send process.
    1222 
    1223 \begin{figure}
    1224         \centering
    1225         \subfloat[AMD Executor Benchmark]{
    1226                 \resizebox{0.5\textwidth}{!}{\input{figures/nasusExecutor.pgf}}
    1227                 \label{f:ExecutorAMD}
    1228         }
    1229         \subfloat[Intel Executor Benchmark]{
    1230                 \resizebox{0.5\textwidth}{!}{\input{figures/pykeExecutor.pgf}}
    1231                 \label{f:ExecutorIntel}
    1232         }
    1233         \caption{The executor benchmark comparing actor systems (lower is better).}
    1234 \end{figure}
    1235 
    1236 The results of the executor benchmark in Figures~\ref{f:ExecutorIntel} and \ref{f:ExecutorAMD} show \CFA with the lowest runtime relative to its peers.
    1237 The difference in runtime between \uC and \CFA is largely due to the usage of the copy queue described in Section~\ref{s:copyQueue}.
    1238 The copy queue both reduces and consolidates allocations, heavily reducing contention on the memory allocator.
    1239 Additionally, due to the static typing in \CFA's actor system, it is able to get rid of expensive dynamic casts that occur in \uC to discriminate messages by type.
    1240 Note that dynamic casts are usually not very expensive, but relative to the high performance of the rest of the implementation of the \uC actor system, the cost is significant.
    1241 
    12421325\begin{figure}
    12431326        \centering
     
    12571340\begin{figure}
    12581341        \centering
    1259         \subfloat[AMD Repeat Benchmark]{
    1260                 \resizebox{0.5\textwidth}{!}{\input{figures/nasusRepeat.pgf}}
    1261                 \label{f:RepeatAMD}
    1262         }
    1263         \subfloat[Intel Repeat Benchmark]{
    1264                 \resizebox{0.5\textwidth}{!}{\input{figures/pykeRepeat.pgf}}
    1265                 \label{f:RepeatIntel}
    1266         }
    1267         \caption{The repeat benchmark comparing actor systems (lower is better).}
    1268 \end{figure}
    1269 
    1270 The repeat microbenchmark also evaluates the executor.
    1271 It stresses the executor's ability to withstand contention on queues, as it repeatedly fans out messages from a single client to 100000 servers who then all respond to the client.
    1272 After this scatter and gather repeats 200 times the benchmark terminates.
    1273 The messages from the servers to the client will likely all come in on the same queue, resulting in high contention.
    1274 As such this benchmark will not scale with the number of processors, since more processors will result in higher contention.
    1275 In Figure~\ref{f:RepeatAMD} we can see that \CFA performs well compared to \uC, however by less of a margin than the executor benchmark.
    1276 One factor in this result is that the contention on the queues poses a significant bottleneck.
    1277 As such the gains from using the copy queue are much less apparent.
    1278 
    1279 \begin{figure}
    1280         \centering
    12811342        \subfloat[AMD \CFA Repeat Benchmark]{
    12821343                \resizebox{0.5\textwidth}{!}{\input{figures/nasusCFARepeat.pgf}}
     
    12891350        \caption{The repeat benchmark comparing \CFA stealing heuristics (lower is better).}
    12901351\end{figure}
    1291 
    1292 In Figure~\ref{f:RepeatIntel} \uC and \CFA are very comparable.
    1293 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
    1294 The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
    1295 When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
    1296 As such stealing the client can result in a hit in performance.
    12971352
    12981353This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
     
    13071362In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
    13081363
    1309 \begin{table}[t]
    1310         \centering
    1311         \setlength{\extrarowheight}{2pt}
    1312         \setlength{\tabcolsep}{5pt}
    1313 
    1314         \caption{Executor Program Memory High Watermark}
    1315         \label{t:ExecutorMemory}
    1316         \begin{tabular}{*{5}{r|}r}
    1317                 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c|}{CAF} & \multicolumn{1}{c|}{Akka} & \multicolumn{1}{c|}{\uC} & \multicolumn{1}{c@{}}{ProtoActor} \\
    1318                 \hline
    1319                 AMD             & \input{data/pykeExecutorMem} \\
    1320                 \hline
    1321                 Intel   & \input{data/nasusExecutorMem}
    1322         \end{tabular}
    1323 \end{table}
    1324 
    1325 Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores.
    1326 \CFA has a high watermark relative to the other non-garbage collected systems \uC, and CAF.
    1327 This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed.
    1328 Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems.
    1329 
    1330 \subsection{Matrix Multiply}
    1331 The matrix benchmark evaluates the actor systems in a practical application, where actors concurrently multiplies two matrices.
    1332 The majority of the computation in this benchmark involves computing the final matrix, so this benchmark stresses the actor systems' ability to have actors run work, rather than stressing the executor or message sending system.
    1333 
    1334 Given $Z_{m,r} = X_{m,n} \cdot Y_{n,r}$, the matrix multiply is defined as:
    1335 \begin{displaymath}
    1336 X_{i,j} \cdot Y_{j,k} = \left( \sum_{c=1}^{j} X_{row,c}Y_{c,column} \right)_{i,k}
    1337 \end{displaymath}
    1338 
    1339 The benchmark uses input matrices $X$ and $Y$ that are both $3072$ by $3072$ in size.
    1340 An actor is made for each row of $X$ and is passed via message the information needed to calculate a row of the result matrix $Z$.
    1341 
    1342 
    1343 Given that the bottleneck of the benchmark is the computation of the result matrix, it follows that the results in Figures~\ref{f:MatrixAMD} and \ref{f:MatrixIntel} are clustered closer than other experiments.
    1344 In Figure~\ref{f:MatrixAMD} \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
    1345 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
     1364
     1365In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
     1366The client of this experiment is long running and maintains a lot of state, as it needs to know the handles of all the servers.
     1367When stealing the client or its respective queue (in \CFA's inverted model), moving the client incurs a high cost due to cache invalidation.
     1368As such stealing the client can result in a hit in performance.
     1369
    13461370In Figures~\ref{f:cfaMatrixAMD} and \ref{f:cfaMatrixIntel} there is little negligible performance difference across \CFA stealing heuristics.
    1347 
    1348 \begin{figure}
    1349         \centering
    1350         \subfloat[AMD Matrix Benchmark]{
    1351                 \resizebox{0.5\textwidth}{!}{\input{figures/nasusMatrix.pgf}}
    1352                 \label{f:MatrixAMD}
    1353         }
    1354         \subfloat[Intel Matrix Benchmark]{
    1355                 \resizebox{0.5\textwidth}{!}{\input{figures/pykeMatrix.pgf}}
    1356                 \label{f:MatrixIntel}
    1357         }
    1358         \caption{The matrix benchmark comparing actor systems (lower is better).}
    1359 \end{figure}
    13601371
    13611372\begin{figure}
Note: See TracChangeset for help on using the changeset viewer.