Context Navigation

-                      r82a90d4
+                      rddcaff6
 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
 All of these systems have a \gls{uthrding} model.
 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
+The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees.
 Note that only the code of the \CFA tests is shown;
 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
+all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself.
 \section{Benchmark Environment}\label{microenv}
 …
         \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:cycle:jax}
 \end{figure}
 …
         \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:cycle:nasus}
 \end{figure}
 …
 Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc.
 \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worst performance compared to the left column.
+Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column.
 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
 …
 Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details.
 Note the maximum of the Y-axis on Intel and AMD differ significantly.
 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
+Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability.
 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
 …
 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
 Again, the same performance increase for libfibre is visible when running fewer \ats.
 Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
+I did not investigate the libfibre performance boost for 1 cycle in this experiment.
 The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark.
 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
+For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees.
 \section{Yield}
 …
         \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:yield:jax}
 \end{figure}
 …
         \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:yield:nasus}
 \end{figure}
 …
 Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling.
 Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs.
 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.
+This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed.
 Go and Tokio still display the same performance collapse as on Intel.
 Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine.
 …
 It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently.
 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
+The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
 …
 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
+With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other's data.
+Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data.
 In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly.
 …
         \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:churn:jax}
 \end{figure}
 …
 Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs.
 Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values.
 Finally, Go gets the most peculiar results, scaling worst than other runtimes until 48 \procs.
+Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs.
 At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing.
 However, beyond this point Go keeps this level of variation but does not scale further in any of the runs.
 Throughput and scalability are notably worst for all runtimes than the previous benchmarks since there is inherently more communication between processors.
+Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors.
 Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second.
 Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling.
 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worst and performance completely ceases to improve.
+However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve.
 At this point, the benchmark is dominated by inter-socket communication costs for all runtimes.
 …
         \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:churn:nasus}
 \end{figure}
 …
         \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:locality:jax}
 \end{figure}
 …
         \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
         For throughput, higher is better, for scalability, lower is better.
+        Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        Each series represents 15 independent runs.
+        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
         \label{fig:locality:nasus}
 \end{figure}
 …
 Go still has the same poor performance as on Intel.
 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
+Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
 Go still has the same poor performance.
 …
 \end{centering}
 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
+For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time).
 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
 \label{fig:transfer:res}
 …
 The first two columns show the results for the semaphore variation on Intel.
 While there are some differences in latencies, \CFA is consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.
+While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
+Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal.
 Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
 …
 Neither Libfibre nor Tokio complete the experiment.
 This experiment clearly demonstrates that \CFA achieves significantly better fairness.
+This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee.
 The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly.
 Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader.
 …
 Without \procs stealing from the \proc running the leader, the experiment cannot terminate.
 Go manages to complete the experiment because it adds preemption on top of classic work-stealing.
 However, since preemption is fairly infrequent, it achieves significantly worst performance.
+However, since preemption is fairly infrequent, it achieves significantly worse performance.
 In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness.
 Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset ddcaff6 for doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

Download in other formats: