Changeset c54ca97 for doc/theses/colby_parsons_MMAth/text/actors.tex
- Timestamp:
- Jul 11, 2023, 9:59:20 AM (14 months ago)
- Branches:
- master
- Children:
- 4c8ce47
- Parents:
- a2eb21a (diff), 39e6309 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/colby_parsons_MMAth/text/actors.tex
ra2eb21a rc54ca97 752 752 The values swapped are never null pointers, so a null pointer can be used as an intermediate value during the swap. 753 753 \end{enumerate} 754 Figure~\ref{ c:swap} shows the \CFA pseudocode for the \gls{dcasw}.754 Figure~\ref{f:dcaswImpl} shows the \CFA pseudocode for the \gls{dcasw}. 755 755 In detail, a thief performs the following steps to swap two pointers: 756 756 \begin{enumerate}[start=0] … … 765 765 Since each worker owns a disjoint range of the queue array, it is impossible for @my_queue@ to be null. 766 766 Note, this algorithm is simplified due to each worker owning a disjoint range, allowing only the @vic_queue@ to be checked for null. 767 This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{ c:swap} to also check @my_queue@ for null.767 This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{f:dcaswImpl} to also check @my_queue@ for null. 768 768 Further discussion of this generalization is omitted since it is not needed for the presented application. 769 769 \item … … 811 811 \end{cfa} 812 812 \caption{DCASW Concurrent} 813 \label{ c:swap}813 \label{f:dcaswImpl} 814 814 \end{figure} 815 815 … … 817 817 \gls{dcasw} is correct in both the success and failure cases. 818 818 \end{theorem} 819 To verify sequential correctness, Figure~\ref{ s:swap} shows a simplified \gls{dcasw}.819 To verify sequential correctness, Figure~\ref{f:seqSwap} shows a simplified \gls{dcasw}. 820 820 Step 1 is missing in the sequential example since it only matters in the concurrent context. 821 821 By inspection, the sequential swap copies each pointer being swapped, and then the original values of each pointer are reset using the copy of the other pointer. … … 836 836 \end{cfa} 837 837 \caption{DCASW Sequential} 838 \label{ s:swap}838 \label{f:seqSwap} 839 839 \end{figure} 840 840 … … 895 895 First it is important to state that a thief will not attempt to steal from themselves. 896 896 As such, the victim here is not also a thief. 897 Stepping through the code in \ref{ c:swap}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.897 Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@. 898 898 Similarly for all thieves step 2 will succeed since no one is stealing from any of the thieves. 899 899 In step 3 the first thief to @CAS@ will win the race and successfully swap the queue pointer. … … 991 991 992 992 The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work. 993 The timestamps are generated using @rdtsc@~\cite{ } and are stored in a shared array, with one index per worker.993 The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker. 994 994 Thieves then attempt to steal from the worker with the oldest timestamp. 995 995 The intuition behind this heuristic is that the slowest worker will receive help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers. 996 This heuristic means that if two thieves look to steal at the same time, they likely attempt to steal from the same victim. 996 This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work. 997 However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim. 997 998 This approach consequently does increase the chance at contention among thieves; 998 999 however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue. … … 1084 1085 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems. 1085 1086 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions. 1086 % C_TODO cite actor paper1087 1087 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07. 1088 1088 Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed. … … 1096 1096 1097 1097 The benchmarks are run on 1--48 cores. 1098 On the Intel, with 24 core sockets, there is the choice to either hop ping sockets or usinghyperthreads on the same socket.1098 On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket. 1099 1099 Either choice causes a blip in performance, which is seen in the subsequent performance graphs. 1100 The choice i s to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.1100 The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores. 1101 1101 1102 1102 All benchmarks are run 5 times and the median is taken. … … 1159 1159 However, Akka and ProtoActor, slow down by two-orders of magnitude. 1160 1160 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation. 1161 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.1161 Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work. 1162 1162 1163 1163 \subsection{Executor}\label{s:executorPerf} … … 1209 1209 It stresses the executor's ability to withstand contention on queues. 1210 1210 The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client. 1211 The scatter and gather arerepeats 200 times.1211 The scatter and gather repeats 200 times. 1212 1212 The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers. 1213 1213 As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue. … … 1219 1219 on the Intel, uC++, ProroActor, and Akka are spread out. 1220 1220 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel. 1221 Here, gains from using the copy queue are much less apparent. 1221 This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather. 1222 The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}. 1223 Here, gains from using the copy queue are much less apparent, due to the costs of stealing. 1222 1224 1223 1225 \begin{table} … … 1258 1260 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems. 1259 1261 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading. 1260 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.1262 It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}. 1261 1263 1262 1264 \begin{figure} … … 1273 1275 \end{figure} 1274 1276 1275 \subsection{Work Stealing} 1277 \subsection{Work Stealing}\label{s:steal_perf} 1276 1278 1277 1279 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}. … … 1352 1354 1353 1355 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations. 1356 As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around. 1357 If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache. 1358 This steal is likely to happen since there is little other work in the system between scatter/gather rounds. 1354 1359 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing). 1355 1360 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager. … … 1360 1365 This hypothesis stems from experimentation with \CFA. 1361 1366 CAF uses a randomized work stealing heuristic. 1362 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.1363 1367 Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark. 1368 This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message. 1364 1369 1365 1370 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
Note: See TracChangeset
for help on using the changeset viewer.