Context Navigation

← Previous Changeset
Next Changeset →

Changeset c54ca97

Timestamp:

Jul 11, 2023, 9:59:20 AM (13 months ago)

Author:

Peter A. Buhr <pabuhr@…>

Branches:

master

Children:

4c8ce47

Parents:

a2eb21a (diff), 39e6309 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:

doc/theses/colby_parsons_MMAth

Files:

: 3 edited

local.bib (modified) (1 diff)
text/actors.tex (modified) (16 diffs)
text/waituntil.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

doc/theses/colby_parsons_MMAth/local.bib

-                      ra2eb21a
+                      rc54ca97
   year={2004}
+}
+@manual{IntelManual,
+    keywords    = {Intel},
+    title       = {Intel 64 and IA-32 Architectures Software Developer’s Manual},
+    version = {Version 080},
+    organization= {Intel},
+    month       = March,
+    year        = 2023,
+}

doc/theses/colby_parsons_MMAth/text/actors.tex

-                      ra2eb21a
+                      rc54ca97
 The values swapped are never null pointers, so a null pointer can be used as an intermediate value during the swap.
 \end{enumerate}
 Figure~\ref{c:swap} shows the \CFA pseudocode for the \gls{dcasw}.
+Figure~\ref{f:dcaswImpl} shows the \CFA pseudocode for the \gls{dcasw}.
 In detail, a thief performs the following steps to swap two pointers:
 \begin{enumerate}[start=0]
 …
 Since each worker owns a disjoint range of the queue array, it is impossible for @my_queue@ to be null.
 Note, this algorithm is simplified due to each worker owning a disjoint range, allowing only the @vic_queue@ to be checked for null.
 This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{c:swap} to also check @my_queue@ for null.
+This was not listed as a special case of this algorithm, since this requirement can be avoided by modifying Step 1 of Figure~\ref{f:dcaswImpl} to also check @my_queue@ for null.
 Further discussion of this generalization is omitted since it is not needed for the presented application.
 \item
 …
 \end{cfa}
 \caption{DCASW Concurrent}
 \label{c:swap}
+\label{f:dcaswImpl}
 \end{figure}
 …
 \gls{dcasw} is correct in both the success and failure cases.
 \end{theorem}
 To verify sequential correctness, Figure~\ref{s:swap} shows a simplified \gls{dcasw}.
+To verify sequential correctness, Figure~\ref{f:seqSwap} shows a simplified \gls{dcasw}.
 Step 1 is missing in the sequential example since it only matters in the concurrent context.
 By inspection, the sequential swap copies each pointer being swapped, and then the original values of each pointer are reset using the copy of the other pointer.
 …
 \end{cfa}
 \caption{DCASW Sequential}
 \label{s:swap}
+\label{f:seqSwap}
 \end{figure}
 …
 First it is important to state that a thief will not attempt to steal from themselves.
 As such, the victim here is not also a thief.
 Stepping through the code in \ref{c:swap}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.
+Stepping through the code in \ref{f:dcaswImpl}, for all thieves steps 0-1 succeed since the victim is not stealing and will have no queue pointers set to be @0p@.
 Similarly for all thieves step 2 will succeed since no one is stealing from any of the thieves.
 In step 3 the first thief to @CAS@ will win the race and successfully swap the queue pointer.
 …
 The longest-victim heuristic maintains a timestamp per executor thread that is updated every time a worker attempts to steal work.
 The timestamps are generated using @rdtsc@~\cite{} and are stored in a shared array, with one index per worker.
+The timestamps are generated using @rdtsc@~\cite{IntelManual} and are stored in a shared array, with one index per worker.
 Thieves then attempt to steal from the worker with the oldest timestamp.
 The intuition behind this heuristic is that the slowest worker will receive help via work stealing until it becomes a thief, which indicates that it has caught up to the pace of the rest of the workers.
+This heuristic means that if two thieves look to steal at the same time, they likely attempt to steal from the same victim.
+This heuristic should ideally result in lowered latency for message sends to victim workers that are overloaded with work.
+However, a side-effect of this heuristic is that if two thieves look to steal at the same time, they likely attempt to steal from the same victim.
 This approach consequently does increase the chance at contention among thieves;
 however, given that workers have multiple queues, often in the tens or hundreds of queues, it is rare for two thieves to attempt stealing from the same queue.
 …
 The performance of \CFA's actor system is tested using a suite of microbenchmarks, and compared with other actor systems.
 Most of the benchmarks are the same as those presented in \cite{Buhr22}, with a few additions.
-% C_TODO cite actor paper
 This work compares with the following actor systems: \CFA 1.0, \uC 7.0.0, Akka Typed 2.7.0, CAF 0.18.6, and ProtoActor-Go v0.0.0-20220528090104-f567b547ea07.
 Akka Classic is omitted as Akka Typed is their newest version and seems to be the direction they are headed.
 …
 The benchmarks are run on 1--48 cores.
 On the Intel, with 24 core sockets, there is the choice to either hopping sockets or using hyperthreads on the same socket.
+On the Intel, with 24 core sockets, there is the choice to either hop sockets or use hyperthreads on the same socket.
 Either choice causes a blip in performance, which is seen in the subsequent performance graphs.
 The choice is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
+The choice in this work is to use hyperthreading instead of hopping sockets for experiments with more than 24 cores.
 All benchmarks are run 5 times and the median is taken.
 …
 However, Akka and ProtoActor, slow down by two-orders of magnitude.
 This difference is likely a result of Akka and ProtoActor's garbage collection, which results in performance delays for allocation-heavy workloads, whereas \uC and \CFA have explicit allocation/deallocation.
 Tuning the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
+Tuning off the garage collection might reduce garbage-collection cost, but this exercise is beyond the scope of this work.
 \subsection{Executor}\label{s:executorPerf}
 …
 It stresses the executor's ability to withstand contention on queues.
 The repeat benchmark repeatedly fans out messages from a single client to 100,000 servers who then respond back to the client.
 The scatter and gather are repeats 200 times.
+The scatter and gather repeats 200 times.
 The messages from the servers to the client all come to the same mailbox queue associated with the client, resulting in high contention among servers.
 As such, this benchmark does not scale with the number of processors, since more processors result in higher contention on the single mailbox queue.
 …
 on the Intel, uC++, ProroActor, and Akka are spread out.
 Finally, \CFA runs consistently on both of the AMD and Intel, and is faster than \uC on the AMD, but slightly slower on the Intel.
+Here, gains from using the copy queue are much less apparent.
+This benchmark is a pathological case for work stealing actor systems, as the majority of work is being performed by the single actor conducting the scatter/gather.
+The impact of work stealing on this benchmark are discussed further in Section~\ref{s:steal_perf}.
+Here, gains from using the copy queue are much less apparent, due to the costs of stealing.
 \begin{table}
 …
 Given that the bottleneck of this benchmark is the computation of the result matrix, it follows that the results are tightly clustered across all actor systems.
 \uC and \CFA have identical performance and in Figure~\ref{f:MatrixIntel} \uC pulls ahead of \CFA after 24 cores likely due to costs associated with work stealing while hyperthreading.
 As mentioned in \ref{s:executorPerf}, it is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation.
+It is hypothesized that CAF performs better in this benchmark compared to others due to its eager work stealing implementation, which will be discussed further in Section~\ref{s:steal_perf}.
 \begin{figure}
 …
 \end{figure}
 \subsection{Work Stealing}
+\subsection{Work Stealing}\label{s:steal_perf}
 \CFA's work stealing mechanism uses the longest-victim heuristic, introduced in Section~\ref{s:victimSelect}.
 …
 This result is shown in Figure~\ref{f:cfaRepeatAMD} and \ref{f:cfaRepeatIntel} where the no-stealing version of \CFA performs better than both stealing variations.
+As mentioned earlier, the repeat benchmark is a pathological case for work stealing systems since there is one actor with the majority of the work, and not enough other work to go around.
+If that actor or it's mail queue is stolen by the work stealing system, it incurs a huge cost to move the work as the single actor touches a lot of memory and will need to refill their local cache.
+This steal is likely to happen since there is little other work in the system between scatter/gather rounds.
 In particular on the Intel machine in Figure~\ref{f:cfaRepeatIntel}, the cost of stealing is higher, which can be seen in the vertical shift of Akka, CAF and \CFA results in Figure~\ref{f:RepeatIntel} (\uC and ProtoActor do not have work stealing).
 The shift for CAF is particularly large, which further supports the hypothesis that CAF's work stealing is particularly eager.
 …
 This hypothesis stems from experimentation with \CFA.
 CAF uses a randomized work stealing heuristic.
 In \CFA if the system is tuned so that it steals work much more eagerly with a randomized it was able to replicate the results that CAF achieves in the matrix benchmark, but this tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
+Tuning the \CFA actor system to steal work much more eagerly with randomized victim selection heuristics provided similar results to what CAF achieved in the matrix benchmark.
+This experimental tuning performed much worse on all other microbenchmarks that we present, since they all perform a small amount of work per message.
 In comparison with the other systems \uC does well on the repeat benchmark since it does not have work stealing.

doc/theses/colby_parsons_MMAth/text/waituntil.tex

-                      ra2eb21a
+                      rc54ca97
 Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
 Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
 If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
+This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{t:pathGo}.
+If \CFA didn't have priorities, the performance difference in Table~\ref{t:pathGo} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
 \subsection{Future Benchmark}

Note: See TracChangeset for help on using the changeset viewer.