Changeset 614868b
- Timestamp:
- Jul 11, 2023, 8:54:41 AM (19 months ago)
- Branches:
- master
- Children:
- ea1bb94
- Parents:
- f80e0f1e
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/colby_parsons_MMAth/text/actors.tex
rf80e0f1e r614868b 1084 1084 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems. 1085 1085 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions. 1086 % C_TODO cite actor paper1087 1086 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07. 1088 1087 Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed. … … 1096 1095 1097 1096 The benchmarks are run on 1--48 cores. 1098 On the Intel, with 24 core sockets, there is the choice to either hop ping sockets or usinghyperthreads on the same socket.1097 On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket. 1099 1098 Either choice causes a blip in performance, which is seen in the subsequent performance graphs. 1100 The choice i s to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.1099 The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores. 1101 1100 1102 1101 All benchmarks are run 5 times and the median is taken. … … 1159 1158 However, Akka and ProtoActor, slow down by two-orders of magnitude. 1160 1159 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation. 1161 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.1160 Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work. 1162 1161 1163 1162 \subsection{Executor}\label{s:executorPerf} … … 1209 1208 It stresses the executor's ability to withstand contention on queues. 1210 1209 The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client. 1211 The scatter and gather arerepeats 200 times.1210 The scatter and gather repeats 200 times. 1212 1211 The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers. 1213 1212 As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue. … … 1219 1218 on the Intel, uC++, ProroActor, and Akka are spread out. 1220 1219 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel. 1221 Here, gains from using the copy queue are much less apparent. 1220 This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather. 1221 The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}. 1222 Here, gains from using the copy queue are much less apparent, due to the costs of stealing. 1222 1223 1223 1224 \begin{table} … … 1258 1259 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems. 1259 1260 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading. 1260 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.1261 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}. 1261 1262 1262 1263 \begin{figure} … … 1273 1274 \end{figure} 1274 1275 1275 \subsection{Work Stealing} 1276 \subsection{Work Stealing}\label{s:steal_perf} 1276 1277 1277 1278 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}. … … 1352 1353 1353 1354 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations. 1355 As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1356 If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache. 1357 This steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1354 1358 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1355 1359 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager. … … 1360 1364 This hypothesis stems from experimentation with \CFA. 1361 1365 CAF uses a randomized work stealing heuristic. 1362 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.1363 1366 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark. 1367 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message. 1364 1368 1365 1369 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
Note: See TracChangeset
for help on using the changeset viewer.