Index: doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex	(revision 36cc24a9ce0e8d7cf82385365ff579f4bd309003)
+++ doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex	(revision ff370d819bd9bbaa7302643a56d716c7f6b3105e)
@@ -2,9 +2,12 @@
 
 The first step in evaluating this work is to test small controlled cases to ensure the basics work properly.
-This chapter presents five different experimental setups, evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
+This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
 All of these systems have a \gls{uthrding} model.
-Note, all tests in each system are functionally identical and available online~\cite{SchedulingBenchmarks}.
+The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
+Note, only the code of the \CFA tests is shown;
+all tests in the other systems are functionally identical and available online~\cite{SchedulingBenchmarks}.
 
 \section{Benchmark Environment}\label{microenv}
+
 All benchmarks are run on two distinct hardware platforms.
 \begin{description}
@@ -25,8 +28,30 @@
 If more \glspl{hthrd} are needed, then 1 NUMA node with hyperthreading is used.
 If still more \glspl{hthrd} are needed, then the experiment is limited to as few NUMA nodes as needed.
+For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading.
+This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192.
+On AMD, the same algorithm is used, but the machine only has 2 sockets.
+So hyperthreading\footnote{
+Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.}
+is used when the \proc count reach 65 and 193.
 
 The limited sharing of the last-level cache on the AMD machine is markedly different than the Intel machine.
 Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different CPU incur a significant latency, on the AMD it is also the case that cache misses served by a different L3 instance on the same CPU also incur high latency.
 
+\section{Experimental setup}
+
+Each experiment is run 15 times varying the number of processors depending on the two different computers.
+All experiments gather throughput data and secondary data for scalability or latency.
+The data is graphed using a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{
+An alternative display is to use error bars with min/max as the bottom/top for the bar.
+However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.}
+This graph presentation offers an overview of the distribution of the results for each experiment.
+
+For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}).
+Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
+In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
+
+The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.
+The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time.
+The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
 
 \section{Cycling latency}
@@ -36,5 +61,5 @@
 However, yielding can be treated as a special case by optimizing it away since the number of ready \ats does not change.
 Hence, systems that perform this optimization have an artificial performance benefit because the yield becomes a \emph{nop}.
-For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}.
+For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}.
 This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
 At runtime, each \at unparks the next \at before parking itself.
@@ -58,5 +83,5 @@
 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
 
-Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark.
+Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats.
 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
 
@@ -110,14 +135,4 @@
 \end{figure}
 
-\subsection{Results}
-
-Figures~\ref{fig:cycle:jax} and~\ref{fig:cycle:nasus} show the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats.
-The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row.
-Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
-In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
-The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.
-The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time.
-The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
-
 \begin{figure}
 	\subfloat[][Throughput, 100 cycles per \proc]{
@@ -150,51 +165,37 @@
 \end{figure}
 
-The experiment ran 15 times for each series and processor count.
-Each series has a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{
-An alternative display is to use error bars with min/max as the bottom/top for the bar.
-However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.}
-This graph presentation offers an overview of the distribution of the results for each series.
-
-The experimental setup uses taskset to limit the placement of \glspl{kthrd} by the operating system.
-As mentioned in Section~\ref{microenv}, the experiment is setup to prioritize running on two \glspl{hthrd} per core before running on multiple sockets.
-For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading.
-This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192.
-On AMD, the same algorithm is used, but the machine only has 2 sockets.
-So hyperthreading\footnote{
-Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.}
-is used when the \proc count reach 65 and 193.
-
-The performance goal of \CFA is to obtain equivalent performance to other less fair schedulers.
-Figure~\ref{fig:cycle:jax:ops} and Figure~\ref{fig:cycle:jax:ns} show that for 100 cycles per \proc, \CFA, Go and Tokio all obtain effectively the same performance.
+\subsection{Results}
+
+For the Intel architecture, Figure~\ref{fig:cycle:jax}:
+\begin{itemize}
+\item
+For 100 cycles per \proc (first column), \CFA, Go and Tokio all obtain effectively the same throughput performance.
 Libfibre is slightly behind in this case but still scales decently.
-As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvements for all runtimes.
-As expected, this pattern repeats between \proc count 72 and 96.
-Hence, Figures~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes.
-
-When running only a single cycle, the story is slightly different.
-\CFA and Tokio obtain very similar results overall, but Tokio shows notably more variations in the results.
-While \CFA, Go and Tokio achieve equivalent performance with 100 cycles per \proc, with only 1 cycle per \proc, Go achieves slightly better performance.
-This difference in throughput and scalability is due to the idle-sleep mechanism.
-With very few cycles, stealing or helping can cause a cascade of tasks migration and trick a \proc into very short idle sleeps, which negatively affect performance.
-
-An interesting and unusual result is that libfibre achieves better performance with 1 cycle.
-This suggest that the cascade effect is never present in libfibre and that some bottleneck disappears in this context.
-However, I did not investigate this result any deeper.
-
-Figure~\ref{fig:cycle:nasus} show a similar story happening on AMD as it does on Intel.
+As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement (flatting of the line) for all runtimes.
+As expected, this pattern repeats again between \proc count 72 and 96.
+\item
+For 1 cycle per \proc, \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
+Go achieves slightly better performance.
+Interestingly, libfibre achieves better performance with 1 cycle.
+\end{itemize}
+
+For the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results show the same story as on the Intel, with close to double the performance overall but with slightly increased variation.
 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
-Unlike Intel, on AMD all 4 runtimes achieve very similar throughput and scalability for 100 cycles per \proc, with some variations in the results.
-
-In the 1 cycle per \proc experiment, the same performance increase for libfibre is visible.
-However, unlike on Intel, Tokio achieves the same performance as Go rather than \CFA.
-This leaves \CFA trailing behind in this particular case, but only at hight core counts.
+\begin{itemize}
+\item
+For 100 cycles per \proc, unlike Intel, all 4 runtimes achieve very similar throughput and scalability.
+\item
+For 1 cycle per \proc, unlike on Intel, Tokio and Go have the same throughput performance, while \CFA is slightly slower.
+Again, the same performance increase for libfibre is visible.
+\end{itemize}
+Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
+
+The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark.
+Clearly, the pathological case with 1 \at per \proc, can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
 For this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
-Essentially, \CFA's fairness must result in slower performance in some workload.
-Fortunately, this effect is only problematic in pathological cases, \eg with 1 \at per \proc, which seldom occurs in most workloads.
-
-The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark.
-This result shows that the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers.
+For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers, except for very unusual workloads.
 
 \section{Yield}
+
 For completion, the classic yield benchmark is included.
 Here, the throughput is dominated by the mechanism used to handle the @yield@ function.
@@ -305,5 +306,5 @@
 \end{figure}
 
-For the AMD, Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased jitter.
+For the AMD architecture, Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased variations.
 Also, some transition points on the X-axis differ because of the architectures, like at 16 versus 24 processors.