Changeset 614868b for doc/theses


Ignore:
Timestamp:
Jul 11, 2023, 8:54:41 AM (12 months ago)
Author:
caparsons <caparson@…>
Branches:
master
Children:
ea1bb94
Parents:
f80e0f1e
Message:

first pass at cleaning up per chapter reorganization

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    rf80e0f1e r614868b  
    10841084The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
    10851085Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions.
    1086 % C_TODO cite actor paper
    10871086This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07.
    10881087Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed.
     
    10961095
    10971096The benchmarks are run on 1--48 cores.
    1098 On the Intel, with 24 core sockets, there is the choice to either hopping sockets or using hyperthreads on the same socket.
     1097On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket.
    10991098Either choice causes a blip in performance, which is seen in the subsequent performance graphs.
    1100 The choice is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
     1099The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
    11011100
    11021101All benchmarks are run 5 times and the median is taken.
     
    11591158However, Akka and ProtoActor, slow down by two-orders of magnitude.
    11601159This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation.
    1161 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
     1160Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
    11621161
    11631162\subsection{Executor}\label{s:executorPerf}
     
    12091208It stresses the executor's ability to withstand contention on queues.
    12101209The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
    1211 The scatter and gather are repeats 200 times.
     1210The scatter and gather repeats 200 times.
    12121211The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
    12131212As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
     
    12191218on the Intel, uC++, ProroActor, and Akka are spread out.
    12201219Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
    1221 Here, gains from using the copy queue are much less apparent.
     1220This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
     1221The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}.
     1222Here, gains from using the copy queue are much less apparent, due to the costs of stealing.
    12221223
    12231224\begin{table}
     
    12581259Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
    12591260\uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
    1260 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
     1261It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}.
    12611262
    12621263\begin{figure}
     
    12731274\end{figure}
    12741275
    1275 \subsection{Work Stealing}
     1276\subsection{Work Stealing}\label{s:steal_perf}
    12761277
    12771278\CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}.
     
    13521353
    13531354This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
     1355As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around.
     1356If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache.
     1357This steal is likely to happen since there is little other work in the system between scatter/gather rounds.
    13541358In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
    13551359The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
     
    13601364This hypothesis stems from experimentation with \CFA.
    13611365CAF uses a randomized work stealing heuristic.
    1362 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
    1363 
     1366Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
     1367This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
    13641368
    13651369In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
Note: See TracChangeset for help on using the changeset viewer.