Changeset c54ca97


Ignore:
Timestamp:
Jul 11, 2023, 9:59:20 AM (13 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
4c8ce47
Parents:
a2eb21a (diff), 39e6309 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/colby_parsons_MMAth
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/local.bib

    ra2eb21a rc54ca97  
    196196  year={2004}
    197197}
     198
     199@manual{IntelManual,
     200    keywords    = {Intel},
     201    title       = {Intel 64 and IA-32 Architectures Software Developer’s Manual},
     202    version = {Version 080},
     203    organization= {Intel},
     204    month       = March,
     205    year        = 2023,
     206}
  • doc/theses/colby_parsons_MMAth/text/actors.tex

    ra2eb21a rc54ca97  
    752752The values swapped are never null pointers, so a null pointer can be used as an intermediate value during the swap.
    753753\end{enumerate}
    754 Figure~\ref{c:swap} shows the \CFA pseudocode for the \gls{dcasw}.
     754Figure~\ref{f:dcaswImpl} shows the \CFA pseudocode for the \gls{dcasw}.
    755755In detail, a thief performs the following steps to swap two pointers:
    756756\begin{enumerate}[start=0]
     
    765765Since each worker owns a disjoint range of the queue array, it is impossible for @my_queue@ to be null.
    766766Note, this algorithm is simplified due to each worker owning a disjoint range, allowing only the @vic_queue@ to be checked for null.
    767 This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{c:swap} to also check @my_queue@ for null.
     767This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{f:dcaswImpl} to also check @my_queue@ for null.
    768768Further discussion of this generalization is omitted since it is not needed for the presented application.
    769769\item
     
    811811\end{cfa}
    812812\caption{DCASW Concurrent}
    813 \label{c:swap}
     813\label{f:dcaswImpl}
    814814\end{figure}
    815815
     
    817817\gls{dcasw} is correct in both the success and failure cases.
    818818\end{theorem}
    819 To verify sequential correctness, Figure~\ref{s:swap} shows a simplified \gls{dcasw}.
     819To verify sequential correctness, Figure~\ref{f:seqSwap} shows a simplified \gls{dcasw}.
    820820Step 1 is missing in the sequential example since it only matters in the concurrent context.
    821821By inspection, the sequential swap copies each pointer being swapped, and then the original values of each pointer are reset using the copy of the other pointer.
     
    836836\end{cfa}
    837837\caption{DCASW Sequential}
    838 \label{s:swap}
     838\label{f:seqSwap}
    839839\end{figure}
    840840
     
    895895First it is important to state that a thief will not attempt to steal from themselves.
    896896As such, the victim here is not also a thief.
    897 Stepping through the code in \ref{c:swap}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.
     897Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.
    898898Similarly for all thieves step 2 will succeed since no one is stealing from any of the thieves.
    899899In step 3 the first thief to @CAS@ will win the race and successfully swap the queue pointer.
     
    991991
    992992The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work.
    993 The timestamps are generated using @rdtsc@~\cite{} and are stored in a shared array, with one index per worker.
     993The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker.
    994994Thieves then attempt to steal from the worker with the oldest timestamp.
    995995The intuition behind this heuristic is that the slowest worker will receive help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers.
    996 This heuristic means that if two thieves look to steal at the same time, they likely attempt to steal from the same victim.
     996This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work.
     997However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim.
    997998This approach consequently does increase the chance at contention among thieves;
    998999however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue.
     
    10841085The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
    10851086Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions.
    1086 % C_TODO cite actor paper
    10871087This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07.
    10881088Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed.
     
    10961096
    10971097The benchmarks are run on 1--48 cores.
    1098 On the Intel, with 24 core sockets, there is the choice to either hopping sockets or using hyperthreads on the same socket.
     1098On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket.
    10991099Either choice causes a blip in performance, which is seen in the subsequent performance graphs.
    1100 The choice is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
     1100The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
    11011101
    11021102All benchmarks are run 5 times and the median is taken.
     
    11591159However, Akka and ProtoActor, slow down by two-orders of magnitude.
    11601160This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation.
    1161 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
     1161Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
    11621162
    11631163\subsection{Executor}\label{s:executorPerf}
     
    12091209It stresses the executor's ability to withstand contention on queues.
    12101210The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
    1211 The scatter and gather are repeats 200 times.
     1211The scatter and gather repeats 200 times.
    12121212The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
    12131213As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
     
    12191219on the Intel, uC++, ProroActor, and Akka are spread out.
    12201220Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
    1221 Here, gains from using the copy queue are much less apparent.
     1221This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
     1222The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}.
     1223Here, gains from using the copy queue are much less apparent, due to the costs of stealing.
    12221224
    12231225\begin{table}
     
    12581260Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
    12591261\uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
    1260 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
     1262It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}.
    12611263
    12621264\begin{figure}
     
    12731275\end{figure}
    12741276
    1275 \subsection{Work Stealing}
     1277\subsection{Work Stealing}\label{s:steal_perf}
    12761278
    12771279\CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}.
     
    13521354
    13531355This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
     1356As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around.
     1357If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache.
     1358This steal is likely to happen since there is little other work in the system between scatter/gather rounds.
    13541359In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
    13551360The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
     
    13601365This hypothesis stems from experimentation with \CFA.
    13611366CAF uses a randomized work stealing heuristic.
    1362 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
    1363 
     1367Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
     1368This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
    13641369
    13651370In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.
  • doc/theses/colby_parsons_MMAth/text/waituntil.tex

    ra2eb21a rc54ca97  
    501501Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
    502502Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
    503 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
    504 If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
     503This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{t:pathGo}.
     504If \CFA didn't have priorities, the performance difference in Table~\ref{t:pathGo} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
    505505
    506506\subsection{Future Benchmark}
Note: See TracChangeset for help on using the changeset viewer.