Context Navigation

-                      r9f99799
+                      r7a0f798b
 \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
 \label{fig:cycle:code}
-%\end{figure}ll have a physical key so it's not urgent.
 \bigskip
-%\begin{figure}
         \subfloat[][Throughput, 100 cycles per \proc]{
                 \resizebox{0.5\linewidth}{!}{
 …
 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the results for the churn experiment on Intel and AMD, respectively.
 Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc have, and all runtimes obtain fairly similar throughput for most \proc counts.
+Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc, and all runtimes obtain fairly similar throughput for most \proc counts.
 \CFA does very well on a single \proc but quickly loses its advantage over the other runtimes.
 As expected, it scales decently up to 48 \procs, drops from 48 to 72 \procs, and then plateaus.
 …
 Libfibre follows very closely behind with basically the same performance and scaling.
 Tokio maintains effectively the same curve shapes as \CFA and libfibre, but it incurs extra costs for all \proc counts.
-% As a result it is slightly outperformed by \CFA and libfibre.
 While Go maintains overall similar results to the others, it again encounters significant variation at high \proc counts.
 Inexplicably resulting in super-linear scaling for some runs, \ie the scalability curves displays a negative slope.
 …
 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
 The locality experiment includes two variations of the churn benchmark, where a data array is added.
 In both variations, before @V@ing the semaphore, each \at increments random cells inside the data array by calling a @work@ function.
+In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
 In the noshare variation, the array is not passed on and each thread continuously accesses its private array.
 In the share variation, the array is passed to another thread via the semaphore's shadow-queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread.
 …
 In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
 In the shared variation, unparking the \at on a remote \proc is an appropriate choice.
-\todo{PAB: I changed these sentences around.}
 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy.
 …
 This scenario is a harder case to handle because corrective measures must be taken even when work is available.
 Note, runtimes with preemption circumvent this problem by forcing the spinner to yield.
+In \CFA preemption was disabled as it only obfuscates the results.
+I am not aware of a method to disable preemption in Go.
 In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
 …
 The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged.
 The yielding variation is denoted ``Yield''.
 The experiment is only run for many \procs, since scaling is not the focus of this experiment.
+The experiment is only run for few and many \procs, since scaling is not the focus of this experiment.
 The first two columns show the results for the semaphore variation on Intel.
 …
 Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
 \CFA achieves better latencies, presumably due to no synchronization with the yield.
-\todo{PAB: what about \CFA preemption? How does that come into play for your scheduler?}
 Go does complete the experiment, but with drastically higher latency:
 latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs.
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption\footnote{
+Preemption is done at the function prolog when the goroutine's stack is increasing;
+whereas \CFA uses fine-grain preemption between any two instructions.}
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption
 , which interrupts the spinning leader after a period.
 Neither Libfibre or Tokio complete the experiment.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 7a0f798b for doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

Download in other formats: