Context Navigation

← Previous Change
Next Change →

Changeset 71cf630 for doc/theses/thierry_delisle_PhD/thesis/text

Timestamp:

Aug 16, 2022, 4:04:47 PM (4 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation, stuck-waitfor-destruct

Children:

Parents:

741e22c (diff), 17c6edeb (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:

doc/theses/thierry_delisle_PhD/thesis/text

Files:

: 1 added
: 9 edited

conclusion.tex (added)
core.tex (modified) (6 diffs)
eval_macro.tex (modified) (6 diffs)
eval_micro.tex (modified) (20 diffs)
existing.tex (modified) (6 diffs)
front.tex (modified) (3 diffs)
intro.tex (modified) (5 diffs)
io.tex (modified) (4 diffs)
practice.tex (modified) (4 diffs)
runtime.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

-              r741e22c
+              r71cf630
 For threading, a simple and common execution mental-model is the ``Ideal multi-tasking CPU'' :
 \begin{displayquote}[Linux CFS\cit{https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt}]
+\begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 tasks running, then it runs each at 50\% physical power --- i.e., actually in parallel.
         \label{q:LinuxCFS}
 …
 This suggests to the following approach:
 \subsection{Dynamic Entropy}\cit{https://xkcd.com/2318/}
+\subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy}
 The Relaxed-FIFO approach can be made to handle the case of mostly empty subqueues by tweaking the \glsxtrlong{prng}.
 The \glsxtrshort{prng} state can be seen as containing a list of all the future subqueues that will be accessed.
 While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the subqueues that were accessed.
 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cit{https://en.wikipedia.org/wiki/Linear\_congruential\_generator} support running the algorithm backwards while offering good quality and performance.
+Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
 This particular \glsxtrshort{prng} can be used as follows:
 \begin{itemize}
 …
 The alternative is to do it the other way around.
 \section{Work Stealing++}
+\section{Work Stealing++}\label{helping}
 To add stronger fairness guarantees to work stealing a few changes are needed.
 First, the relaxed-FIFO algorithm has fundamentally better fairness because each \proc always monitors all subqueues.
 …
         \input{base.pstex_t}
         \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of subqueues offers the sharding, two per \glspl{proc}.
         Each \gls{proc} can access all of the subqueues.
+        Each \gls{proc} can access all of the subqueues.
         Each \at is timestamped when enqueued.}
         \label{fig:base}
 …
 \end{figure}
 A simple solution to this problem is to use an exponential moving average\cit{https://en.wikipedia.org/wiki/Moving\_average\#Exponential\_moving\_average} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.
+A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.
 Note, this is more complex because the \at at the head of a subqueue is still waiting, so its wait time has not ended.
 Therefore, the exponential moving average is actually an exponential moving average of how long each dequeued \at has waited.
 …
 The good news is that this problem can be mitigated
 \subsection{Redundant Timestamps}
+\subsection{Redundant Timestamps}\ref{relaxedtimes}
 The problem with polling remote subqueues is that correctness is critical.
 There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion.

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

-              r741e22c
+              r71cf630
 \section{Memcached}
+Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store,
+SIGMETRICS 2012}.
+This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}.
+Memcached~\cite{memcached} is an in memory key-value store that is used in many production environments, \eg \cite{atikoglu2012workload}.
+This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cite{GITHUB:mutilate}.
 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
 This experiment does not exercise the \io subsytem with regards to disk operations.
-The experiments compare 3 different varitions of memcached:
-\begin{itemize}
- \item \emph{vanilla}: the official release of memcached, version~1.6.9.
- \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
- \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
-\end{itemize}
 \subsection{Benchmark Environment}
 …
 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch.
+\subsection{Throughput}
+\subsection{Memcached with threads per connection}
+Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model.
+Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler.
+One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing.
+This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections.
+Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}.
+Therefore this version can both be compared to the original version and to a port to the \CFA runtime.
+As such, this memcached experiment compares 3 different varitions of memcached:
+\begin{itemize}
+ \item \emph{vanilla}: the official release of memcached, version~1.6.9.
+ \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.
+ \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.
+\end{itemize}
+\subsection{Throughput} \label{memcd:tput}
 \begin{figure}
         \centering
 …
 \begin{figure}
         \centering
+        \input{result.memcd.updt.qps.pstex_t}
+        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
+        \label{fig:memcd:updt:qps}
+\end{figure}
+\begin{figure}
+        \centering
+        \input{result.memcd.updt.lat.pstex_t}
+        \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description}
+        \label{fig:memcd:updt:lat}
+        \subfloat[][Throughput]{
+                \input{result.memcd.forall.qps.pstex_t}
+        }
+        \subfloat[][Latency]{
+                \input{result.memcd.forall.lat.pstex_t}
+        }
+        \caption[forall Latency results at different update rates]{forall Latency results at different update rates\smallskip\newline Description}
+        \label{fig:memcd:updt:forall}
+\end{figure}
+\begin{figure}
+        \centering
+        \subfloat[][Throughput]{
+                \input{result.memcd.fibre.qps.pstex_t}
+        }
+        \subfloat[][Latency]{
+                \input{result.memcd.fibre.lat.pstex_t}
+        }
+        \caption[fibre Latency results at different update rates]{fibre Latency results at different update rates\smallskip\newline Description}
+        \label{fig:memcd:updt:fibre}
+\end{figure}
+\begin{figure}
+        \centering
+        \subfloat[][Throughput]{
+                \input{result.memcd.vanilla.qps.pstex_t}
+        }
+        \subfloat[][Latency]{
+                \input{result.memcd.vanilla.lat.pstex_t}
+        }
+        \caption[vanilla Latency results at different update rates]{vanilla Latency results at different update rates\smallskip\newline Description}
+        \label{fig:memcd:updt:vanilla}
 \end{figure}
 …
 The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks.
 On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}.
+The static webserver experiments will compare NGINX with a custom webserver developped for this experiment.
+The static webserver experiments will compare NGINX~\cit{nginx} with a custom webserver developped for this experiment.
+\subsection{\CFA webserver}
+Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver.
+It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront.
+Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted.
+Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around.
+Normally, webservers use @sendfile@\cite{MAN:sendfile} to send files over the socket.
+@io_uring@ does not support @sendfile@, it supports @splice@\cite{MAN:splice} instead, which is strictly more powerful.
+However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel.
+As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}.
+Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly.
+For this reason, the \CFA webserver calls @sendfile@ directly.
+This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem.
+When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout.
+As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections.
+Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cite{rfc:tcp} and the tcp buffers will be preserved.
+However, this poses a problem using blocking @sendfile@ calls.
+The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state.
+Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets.
+This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts.
+Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources.
+However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely.
+For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@.
+Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases.
+It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option.
+However, as of writing this document Ubuntu does not have a stable release of Linux 5.15.
+There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment.
+Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway.
+However, since this could not be tested, this is purely a conjecture at this point.
 \subsection{Benchmark Environment}
 …
 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate.
 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
+The kernel is setup to limit the memory at 25Gb.
 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
 …
 \todo{switch}
 \subsection{Throughput}
+\begin{figure}
+        \centering
+        \input{result.swbsrv.25gb.pstex_t}
+        \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline }
+To measure the throughput of both webservers, each server is loaded with over 30,000 files making over 4.5 Gigabytes in total.
+Each client runs httperf~\cit{httperf} which establishes a connection, does an http request for one or more files, closes the connection and repeats the process.
+The connections and requests are made according to a Zipfian distribution~\cite{zipf}.
+Throughput is measured by aggregating the results from httperf of all the clients.
+\begin{figure}
+        \subfloat[][Throughput]{
+                \input{result.swbsrv.25gb.pstex_t}
+                \label{fig:swbsrv:ops}
+        }
+        \subfloat[][Rate of Errors]{
+                \input{result.swbsrv.25gb.err.pstex_t}
+                \label{fig:swbsrv:err}
+        }
+        \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.}
         \label{fig:swbsrv}
 \end{figure}
+Networked ZIPF
+Nginx : 5Gb still good, 4Gb starts to suffer
+Cforall : 10Gb too high, 4 Gb too low
+Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput.
+These results are fairly straight forward.
+Both servers achieve the same throughput until around 57,500 requests per seconds.
+Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate.
+Once the saturation point is reached, both servers are still very close.
+NGINX achieves slightly better throughtput.
+However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation.
+This suggest that \CFA is slightly more fair and NGINX may sloghtly sacrifice some fairness for improved throughtput.
+It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine.
+\subsection{Disk Operations}
+The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the reste of the machine.
+Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine.
+If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk.
+However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance.
+However, since what I am to evaluate in this thesis is the runtime of \CFA, I diceded to forgo experiments on low memory server.
+The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime.

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

-              r741e22c
+              r71cf630
 This chapter presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler.
 \section{Benchmark Environment}
+\section{Benchmark Environment}\label{microenv}
 All benchmarks are run on two distinct hardware platforms.
 \begin{description}
 …
         \centering
         \input{cycle.pstex_t}
         \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.}
+        \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before parking itself.}
         \label{fig:cycle}
 \end{figure}
 The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready queue.
 Since these two operation also describe a @yield@ operation, many systems use this operation as the most basic benchmark.
 However, yielding can be treated as a special case by optimizing it away since the number of ready \glspl{at} does not change.
 Not all systems perform this optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}.
+However, yielding can be treated as a special case and some aspects of the scheduler can be optimized away since the number of ready \ats does not change.
+Not all systems perform this type of optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}.
 For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}.
 This benchmark arranges a number of \glspl{at} into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
 At runtime, each \gls{at} unparks the next \gls{at} before parking itself.
 Unparking the next \gls{at} pushes that \gls{at} onto the ready queue as does the ensuing park.
 Hence, the underlying runtime cannot rely on the number of ready \glspl{at} staying constant over the duration of the experiment.
 In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking.
+This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
+At runtime, each \at unparks the next \at before parking itself.
+Unparking the next \at pushes that \at onto the ready queue while the ensuing park leads to a \at being popped from the ready queue.
+Hence, the underlying runtime cannot rely on the number of ready \ats staying constant over the duration of the experiment.
+In fact, the total number of \ats waiting on the ready queue is expected to vary because of the delay between the next \at unparking and the current \at parking.
 That is, the runtime cannot anticipate that the current task will immediately park.
+As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \gls{at} parks because of time-slicing or multiple \procs.
+Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops.
+To prevent any attempt of silently omitting ready-queue operations, the ring of \glspl{at} is made big enough so the \glspl{at} have time to fully park before being unparked again.
+(Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.)
+As well, the size of the cycle is also decided based on this delay.
+Note that, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.
+If this happens, the scheduler push and pop are avoided and the results of the experiment would be skewed.
+Because of time-slicing or because cycles can be spread over multiple \procs, a small cycle may see the chain of unparks go full circle before the first \at parks.
+Every runtime system must handle this race and but cannot optimized away the ready-queue pushes and pops if the cycle is long enough.
+To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully park before being unparked again.
 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
-To avoid this benchmark being affected by idle-sleep handling, the number of rings is multiple times greater than the number of \glspl{proc}.
-This design avoids the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentioned above.
 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark.
 …
         count := 0
         for {
+                @this.next.wake()@
                 @wait()@
-                @this.next.wake()@
                 count ++
                 if must_stop() { break }
 …
                 \label{fig:cycle:jax:low:ns}
+        }
         \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better.}
+        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
         \label{fig:cycle:jax}
 \end{figure}
 …
                         \input{result.cycle.nasus.ns.pstex_t}
+                }
+                \label{fig:cycle:nasus:ns}
+        }
         \subfloat[][Scalability, 1 cycle per \proc]{
 …
                 \label{fig:cycle:nasus:low:ns}
+        }
         \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better.}
+        \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
         \label{fig:cycle:nasus}
 \end{figure}
 Figure~\ref{fig:cycle:jax} and Figure~\ref{fig:cycle:nasus} shows the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats.
 The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row.
 Where scalability uses the same data but the Y axis is calculated as throughput over the number of \procs.
+Where scalability uses the same data but the Y axis is calculated as the number of \procs over the throughput.
 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.
 …
 The distinction is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
+The experiment was run 15 times for each series and processor count and the \emph{$\times$}s on the graph show all of the results obtained.
+Each series also has a solid and two dashed lines highlighting the median, maximum and minimum result respectively.
+This presentation offers an overview of the distribution of the results for each series.
+The experimental setup uses taskset to limit the placement of \glspl{kthrd} by the operating system.
+As mentioned in Section~\ref{microenv}, the experiement is setup to prioritize running on 2 \glspl{hthrd} per core before running on multiple sockets.
+For the Intel machine, this means that from 1 to 24 \procs, one socket and \emph{no} hyperthreading is used and from 25 to 48 \procs, still only one socket but \emph{with} hyperthreading.
+This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192.
+On AMD, the same algorithm is used, but the machine only has 2 sockets.
+So hyperthreading\footnote{Hyperthreading normally refers specifically to the technique used by Intel, however here it is loosely used to refer to AMD's equivalent feature.} is used when the \proc count reach 65 and 193.
+Figure~\ref{fig:cycle:jax:ops} and Figure~\ref{fig:cycle:jax:ns} show that for 100 cycles per \proc, \CFA, Go and Tokio all obtain effectively the same performance.
+Libfibre is slightly behind in this case but still scales decently.
+As a result of the \gls{kthrd} placement, we can see that additional \procs from 25 to 48 offer less performance improvements for all runtimes.
+As expected, this pattern repeats between \proc count 72 and 96.
 The performance goal of \CFA is to obtain equivalent performance to other, less fair schedulers and that is what results show.
 Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes.
+The experimental setup prioritizes running on 2 \glspl{hthrd} per core before running on multiple sockets.
+The effect of that setup is seen from 25 to 48 \procs, running on 24 core with 2 \glspl{hthrd} per core.
+This effect is again repeated from 73 and 96 \procs, where it happens on the second CPU.
+When running only a single cycle, most runtime achieve lower throughput because of the idle-sleep mechanism.
+In Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns}
+Figure~\ref{fig:cycle:nasus} show effectively the same story happening on AMD as it does on Intel.
+The different performance bumps due to cache topology happen at different locations and there is a little more variability.
+However, in all cases \CFA is still competitive with other runtimes.
+When running only a single cycle, the story is slightly different.
+\CFA and tokio obtain very smiliar results overall, but tokio shows notably more variations in the results.
+While \CFA, Go and tokio achive equivalent performance with 100 cycles per \proc, with only 1 cycle per \proc Go achieves slightly better performance.
+This difference in throughput and scalability is due to the idle-sleep mechanism.
+With very few cycles, stealing or helping can cause a cascade of tasks migration and trick \proc into very short idle sleeps.
+Both effect will negatively affect performance.
+An interesting and unusual result is that libfibre achieves better performance with fewer cycle.
+This suggest that the cascade effect is never present in libfibre and that some bottleneck disappears in this context.
+However, I did not investigate this result any deeper.
+Figure~\ref{fig:cycle:nasus} show a similar story happening on AMD as it does on Intel.
+The different performance improvements and plateaus due to cache topology appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
+Unlike Intel, on AMD all 4 runtimes achieve very similar throughput and scalability for 100 cycles per \proc.
+In the 1 cycle per \proc experiment, the same performance increase for libfibre is visible.
+However, unlike on Intel, tokio achieves the same performance as Go rather than \CFA.
+This leaves \CFA trailing behind in this particular case, but only at hight core counts.
+Presumably this is because in this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
+Since this effect is only problematic in cases with 1 \at per \proc it is not very meaningful for the general performance.
+The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance in this scenario.
+Which demonstrate that in this case \CFA achieves equivalent performance.
 \section{Yield}
 For completion, the classic yield benchmark is included.
 This benchmark is simpler than the cycle test: it creates many \glspl{at} that call @yield@.
+This benchmark is simpler than the cycle test: it creates many \ats that call @yield@.
 As mentioned, this benchmark may not be representative because of optimization shortcuts in @yield@.
 The only interesting variable in this benchmark is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) can be empty.
+The only interesting variable in this benchmark is the number of \ats per \procs, where ratios close to 1 means the ready queue(s) can be empty.
 This scenario can put a strain on the idle-sleep handling compared to scenarios where there is plenty of work.
 Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, where the @wait/next.wake@ is replaced by @yield@.
 …
                 \label{fig:yield:jax:low:ns}
+        }
         \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better.}
+        \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
         \label{fig:yield:jax}
 \end{figure}
 …
                         \input{result.yield.nasus.ns.pstex_t}
+                }
+                \label{fig:yield:nasus:ns}
+        }
         \subfloat[][Scalability, 1 \at per \proc]{
 …
                 \label{fig:yield:nasus:low:ns}
+        }
         \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better.}
+        \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
         \label{fig:yield:nasus}
 \end{figure}
+Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count, where each run uses 100 \ats per \proc.
+Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count on Intel.
 It is fairly obvious why I claim this benchmark is more artificial.
 The throughput is dominated by the mechanism used to handle the @yield@.
+\CFA does not have special handling for @yield@ and achieves very similar performance to the cycle benchmark.
+Libfibre uses the fact that @yield@ doesn't change the number of ready fibres and by-passes the idle-sleep mechanism entirely, producing significantly better throughput.
+Go puts yielding goroutines on a secondary global ready-queue, giving them lower priority.
+The result is that multiple \glspl{hthrd} contend for the global queue and performance suffers drastically.
+Based on the scalability, Tokio obtains the same poor performance and therefore it is likely it handles @yield@ in a similar fashion.
+\CFA does not have special handling for @yield@ but the experiment requires less synchronization.
+As a result achieves better performance than the cycle benchmark, but still comparable.
 When the number of \ats is reduce to 1 per \proc, the cost of idle sleep also comes into play in a very significant way.
 If anything causes a \at migration, where two \ats end-up on the same ready-queue, work-stealing will start occuring and cause every \at to shuffle around.
+If anything causes a \at migration, where two \ats end-up on the same ready-queue, work-stealing will start occuring and could cause several \ats to shuffle around.
 In the process, several \procs can go to sleep transiently if they fail to find where the \ats were shuffled to.
 In \CFA, spurious bursts of latency can trick a \proc into helping, triggering this effect.
+However, since user-level threading with equal number of \ats and \procs is a somewhat degenerate case, especially when ctxswitching very often, this result is not particularly meaningful and is only included for completness.
+However, since user-level threading with equal number of \ats and \procs is a somewhat degenerate case, especially when context-switching very often, this result is not particularly meaningful and is only included for completness.
+Libfibre uses the fact that @yield@ doesn't change the number of ready fibres and by-passes the idle-sleep mechanism entirely, producing significantly better throughput.
+Additionally, when only running 1 \at per \proc, libfibre optimizes further and forgoes the context-switch entirely.
+This results in incredible performance results comparing to the other runtimes.
+In stark contrast with libfibre, Go puts yielding goroutines on a secondary global ready-queue, giving them lower priority.
+The result is that multiple \glspl{hthrd} contend for the global queue and performance suffers drastically.
+Based on the scalability, Tokio obtains the similarly poor performance and therefore it is likely it handles @yield@ in a similar fashion.
+However, it must be doing something different since it does scale at low \proc count.
 Again, Figure~\ref{fig:yield:nasus} show effectively the same story happening on AMD as it does on Intel.
 …
 \section{Churn}
 The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application.
 In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} upfront and none of the \glspl{at} communicate with each other.
 The Churn benchmark represents more chaotic execution, where there is no relation between the last \gls{proc} on which a \gls{at} ran and blocked and the \gls{proc} that subsequently unblocks it.
 With processor-specific ready-queues, when a \gls{at} is unblocked by a different \gls{proc} that means the unblocking \gls{proc} must either ``steal'' the \gls{at} from another processor or find it on a global queue.
 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on \gls{at} data structure.
 In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.
+In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
+The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no apparent relation between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
+With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or place it on a remote queue.
+This enqueuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
+In either case, this benchmark aims to measure how well each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.
 This benchmark uses a fixed-size array of counting semaphores.
 Each \gls{at} picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.
 This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves.
 For this benchmark to work, the number of \glspl{at} must be equal or greater than the number of semaphores plus the number of \glspl{proc}.
+Each \at picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.
+This creates a flow where \ats push each other out of the semaphores before being pushed out themselves.
+For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs.
 Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@.
 Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@.
 …
 \subsection{Results}
-Figure~\ref{fig:churn:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.
 \begin{figure}
         \subfloat[][Throughput, 100 \ats per \proc]{
 …
                 \label{fig:churn:jax:ops}
+        }
         \subfloat[][Throughput, 1 \ats per \proc]{
+        \subfloat[][Throughput, 2 \ats per \proc]{
                 \resizebox{0.5\linewidth}{!}{
                         \input{result.churn.low.jax.ops.pstex_t}
 …
                         \input{result.churn.jax.ns.pstex_t}
+                }
+        }
         \subfloat[][Latency, 1 \ats per \proc]{
+                \label{fig:churn:jax:ns}
+        }
+        \subfloat[][Latency, 2 \ats per \proc]{
                 \resizebox{0.5\linewidth}{!}{
                         \input{result.churn.low.jax.ns.pstex_t}
 …
                 \label{fig:churn:jax:low:ns}
+        }
+        \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine.
+        Throughput is the total operation per second across all cores. Latency is the duration of each operation.}
+        \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
         \label{fig:churn:jax}
 \end{figure}
+\todo{results discussion}
+\begin{figure}
+        \subfloat[][Throughput, 100 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.nasus.ops.pstex_t}
+                }
+                \label{fig:churn:nasus:ops}
+        }
+        \subfloat[][Throughput, 2 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.low.nasus.ops.pstex_t}
+                }
+                \label{fig:churn:nasus:low:ops}
+        }
+        \subfloat[][Latency, 100 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.nasus.ns.pstex_t}
+                }
+                \label{fig:churn:nasus:ns}
+        }
+        \subfloat[][Latency, 2 \ats per \proc]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.churn.low.nasus.ns.pstex_t}
+                }
+                \label{fig:churn:nasus:low:ns}
+        }
+        \caption[Churn Benchmark on AMD]{\centering Churn Benchmark on AMD\smallskip\newline Throughput and latency of the Churn on the benchmark on the AMD machine.
+        For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
+        \label{fig:churn:nasus}
+\end{figure}
+Figure~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput as a function of \proc count on Intel and AMD respectively.
+It uses the same representation as the previous benchmark : 15 runs where the dashed line show the extremums and the solid line the median.
+The performance cost of crossing the cache boundaries is still visible at the same \proc count.
+However, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the eachother's data.
+Scalability is notably worst than the previous benchmarks since there is inherently more communication between processors.
+Indeed, once the number of \glspl{hthrd} goes beyond a single socket, performance ceases to improve.
+An interesting aspect to note here is that the runtimes differ in how they handle this situation.
+Indeed, when a \proc unparks a \at that was last run on a different \proc, the \at could be appended to the ready-queue local \proc or to the ready-queue of the remote \proc, which previously ran the \at.
+\CFA, tokio and Go all use the approach of unparking to the local \proc while Libfibre unparks to the remote \proc.
+In this particular benchmark, the inherent chaos of the benchmark in addition to small memory footprint means neither approach wins over the other.
+Like for the cycle benchmark, here all runtimes achieve fairly similar performance.
+Performance improves as long as all \procs fit on a single socket.
+Beyond that performance starts to suffer from increased caching costs.
+Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 1 and 100 \ats per \proc, \CFA, libfibre, Go and tokio achieve effectively equivalent performance for most \proc count.
+However, Figure~\ref{fig:churn:nasus} again shows a somewhat different story on AMD.
+While \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts.
+This performance difference is visible at both high and low \at counts.
+One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore.
+On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected.
+However, in practice there can be implementation difference between the two.
+This is especially true if the semaphore count can get somewhat high.
+Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact.
+As second possible explanation is that Go may sometimes use the heap when allocating variables based on the result of escape analysis of the code.
+It is possible that variables that should be placed on the stack are placed on the heap.
+This could cause extra pointer chasing in the benchmark, heightning locality effects.
+Depending on how the heap is structure, this could also lead to false sharing.
+The objective of this benchmark is to demonstrate that unparking \ats from remote \procs do not cause too much contention on the local queues.
+Indeed, the fact all runtimes achieve some scaling at lower \proc count demontrate that migrations do not need to be serialized.
+Again these result demonstrate \CFA achieves satisfactory performance.
 \section{Locality}
+\todo{code, setup, results}
+\begin{figure}
+\begin{cfa}
+Thread.main() {
+        count := 0
+        for {
+                r := random() % len(spots)
+                // go through the array
+                @work( a )@
+                spots[r].V()
+                spots[r].P()
+                count ++
+                if must_stop() { break }
+        }
+        global.count += count
+}
+\end{cfa}
+\begin{cfa}
+Thread.main() {
+        count := 0
+        for {
+                r := random() % len(spots)
+                // go through the array
+                @work( a )@
+                // pass array to next thread
+                spots[r].V( @a@ )
+                @a = @spots[r].P()
+                count ++
+                if must_stop() { break }
+        }
+        global.count += count
+}
+\end{cfa}
+\caption[Locality Benchmark : Pseudo Code]{Locality Benchmark : Pseudo Code}
+\label{fig:locality:code}
+\end{figure}
+As mentionned in the churn benchmark, when unparking a \at, it is possible to either unpark to the local or remote ready-queue.
+\footnote{It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, there is little to suggest this would not degrade performance.}
+The locality experiment includes two variations of the churn benchmark, where an array of data is added.
+In both variations, before @V@ing the semaphore, each \at increment random cells inside the array.
+The @share@ variation then passes the array to the shadow-queue of the semaphore, transferring ownership of the array to the woken thread.
+In the @noshare@ variation the array is not passed on and each thread continously accesses its private array.
+The objective here is to highlight the different decision made by the runtime when unparking.
+Since each thread unparks a random semaphore, it means that it is unlikely that a \at will be unparked from the last \proc it ran on.
+In the @share@ version, this means that unparking the \at on the local \proc is appropriate since the data was last modified on that \proc.
+In the @noshare@ version, the unparking the \at on the remote \proc is the appropriate approach.
+The expectation for this benchmark is to see a performance inversion, where runtimes will fare notably better in the variation which matches their unparking policy.
+This should lead to \CFA, Go and Tokio achieving better performance in @share@ while libfibre achieves better performance in @noshare@.
+Indeed, \CFA, Go and Tokio have the default policy of unpark \ats on the local \proc, where as libfibre has the default policy of unparks \ats wherever they last ran.
+\subsection{Results}
+\begin{figure}
+        \subfloat[][Throughput share]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.share.jax.ops.pstex_t}
+                }
+                \label{fig:locality:jax:share:ops}
+        }
+        \subfloat[][Throughput noshare]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.noshare.jax.ops.pstex_t}
+                }
+                \label{fig:locality:jax:noshare:ops}
+        }
+        \subfloat[][Scalability share]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.share.jax.ns.pstex_t}
+                }
+                \label{fig:locality:jax:share:ns}
+        }
+        \subfloat[][Scalability noshare]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.noshare.jax.ns.pstex_t}
+                }
+                \label{fig:locality:jax:noshare:ns}
+        }
+        \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
+        \label{fig:locality:jax}
+\end{figure}
+\begin{figure}
+        \subfloat[][Throughput share]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.share.nasus.ops.pstex_t}
+                }
+                \label{fig:locality:nasus:share:ops}
+        }
+        \subfloat[][Throughput noshare]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.noshare.nasus.ops.pstex_t}
+                }
+                \label{fig:locality:nasus:noshare:ops}
+        }
+        \subfloat[][Scalability share]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.share.nasus.ns.pstex_t}
+                }
+                \label{fig:locality:nasus:share:ns}
+        }
+        \subfloat[][Scalability noshare]{
+                \resizebox{0.5\linewidth}{!}{
+                        \input{result.locality.noshare.nasus.ns.pstex_t}
+                }
+                \label{fig:locality:nasus:noshare:ns}
+        }
+        \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.}
+        \label{fig:locality:nasus}
+\end{figure}
+Figure~\ref{fig:locality:jax} and \ref{fig:locality:nasus} shows the results on Intel and AMD respectively.
+In both cases, the graphs on the left column show the results for the @share@ variation and the graphs on the right column show the results for the @noshare@.
+On Intel, Figure~\ref{fig:locality:jax} shows Go trailing behind the 3 other runtimes.
+On the left of the figure showing the results for the shared variation, where \CFA and tokio slightly outperform libfibre as expected.
+And correspondingly on the right, we see the expected performance inversion where libfibre now outperforms \CFA and tokio.
+Otherwise the results are similar to the churn benchmark, with lower throughtput due to the array processing.
+Presumably the reason why Go trails behind are the same as in Figure~\ref{fig:churn:nasus}.
+Figure~\ref{fig:locality:nasus} shows the same experiment on AMD.
+\todo{why is cfa slower?}
+Again, we see the same story, where tokio and libfibre swap places and Go trails behind.
 \section{Transfer}
 The last benchmark is more of an experiment than a benchmark.
 It tests the behaviour of the schedulers for a misbehaved workload.
 In this workload, one of the \gls{at} is selected at random to be the leader.
 The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership.
 The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats.
 The benchmark comes in two flavours for the non-leader \glspl{at}:
+In this workload, one of the \at is selected at random to be the leader.
+The leader then spins in a tight loop until it has observed that all other \ats have acknowledged its leadership.
+The leader \at then picks a new \at to be the next leader and the cycle repeats.
+The benchmark comes in two flavours for the non-leader \ats:
 once they acknowledged the leader, they either block on a semaphore or spin yielding.
 The experiment is designed to evaluate the short-term load-balancing of a scheduler.
 Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experiment to terminate.
 This problem occurs because the spinning \gls{at} is effectively preventing the \gls{proc} from running any other \glspl{thrd}.
 In the semaphore flavour, the number of runnable \glspl{at} eventually dwindles down to only the leader.
 This scenario is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work.
 In the yielding flavour, the number of runnable \glspl{at} stays constant.
+Indeed, schedulers where the runnable \ats are partitioned on the \procs may need to balance the \ats for this experiment to terminate.
+This problem occurs because the spinning \at is effectively preventing the \proc from running any other \at.
+In the semaphore flavour, the number of runnable \ats eventually dwindles down to only the leader.
+This scenario is a simpler case to handle for schedulers since \procs eventually run out of work.
+In the yielding flavour, the number of runnable \ats stays constant.
 This scenario is a harder case to handle because corrective measures must be taken even when work is available.
 Note, runtime systems with preemption circumvent this problem by forcing the spinner to yield.
+\todo{code, setup, results}
+In both flavours, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
+In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by:
+$ \frac{CSL + SL}{NP - 1}$, where $CSL$ is the context switch latency, $SL$ is the cost for enqueuing and dequeuing a \at and $NP$ is the number of \procs.
+However, if the scheduler allows \ats to run many times before other \ats are able to run once, this delay will increase.
+The semaphore version is an approximation of the strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once.
+The benchmark effectively provides the fairness guarantee in this case.
+In the yielding version however, the benchmark provides no such guarantee, which means the scheduler has full responsability and any unfairness will be measurable.
+While this is a fairly artificial scenario, it requires only a few simple pieces.
+The yielding version of this simply creates a scenario where a \at runs uninterrupted in a saturated system, and starvation has an easily measured impact.
+However, \emph{any} \at that runs uninterrupted for a significant period of time in a saturated system could lead to this kind of starvation.
 \begin{figure}
 …
                 return
+        }
         // Wait for everyone to acknowledge my leadership
         start: = timeNow()
 …
+                }
+        }
         // pick next leader
         leader := threads[ prng() % len(threads) ]
         // wake every one
         if ! exhaust {
 …
+        }
+}
 Thread.wait() {
         this.idx_seen := lead_idx
 …
         else { yield() }
+}
 Thread.main() {
         while !done  {
 …
 \subsection{Results}
+Figure~\ref{fig:transfer:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.
+\todo{results discussion}
+\begin{figure}
+\begin{centering}
+\begin{tabular}{r | c c c c | c c c c }
+Machine   &                     \multicolumn{4}{c |}{Intel}                &          \multicolumn{4}{c}{AMD}                    \\
+Variation & \multicolumn{2}{c}{Park} & \multicolumn{2}{c |}{Yield} & \multicolumn{2}{c}{Park} & \multicolumn{2}{c}{Yield} \\
+\procs    &      2      &      192   &      2      &      192      &      2      &      256   &      2      &      256    \\
+\hline
+\CFA      & 106 $\mu$s  & ~19.9 ms   & 68.4 $\mu$s & ~1.2 ms       & 174 $\mu$s  & ~28.4 ms   & 78.8~~$\mu$s& ~~1.21 ms   \\
+libfibre  & 127 $\mu$s  & ~33.5 ms   & DNC         & DNC           & 156 $\mu$s  & ~36.7 ms   & DNC         & DNC         \\
+Go        & 106 $\mu$s  & ~64.0 ms   & 24.6 ms     & 74.3 ms       & 271 $\mu$s  & 121.6 ms   & ~~1.21~ms   & 117.4 ms    \\
+tokio     & 289 $\mu$s  & 180.6 ms   & DNC         & DNC           & 157 $\mu$s  & 111.0 ms   & DNC         & DNC
+\end{tabular}
+\end{centering}
+\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader. }
+\label{fig:transfer:res}
+\end{figure}
+Figure~\ref{fig:transfer:res} shows the result for the transfer benchmark with 2 \procs and all \procs, where each experiement runs 100 \at per \proc.
+Note that the results here are only meaningful as a coarse measurement of fairness, beyond which small cost differences in the runtime and concurrent primitives begin to matter.
+As such, data points that are the on the same order of magnitude as eachother should be basically considered equal.
+The takeaway of this experiement is the presence of very large differences.
+The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged.
+The yielding variation is denoted ``Yield''.
+The experiement was only run for the extremums of the number of cores since the scaling per core behaves like previous experiements.
+This experiments clearly demonstrate that while the other runtimes achieve similar performance in previous benchmarks, here \CFA achieves significantly better fairness.
+The semaphore variation serves as a control group, where all runtimes are expected to transfer leadership fairly quickly.
+Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running leader.
+Figure~\ref{fig:transfer:res} shows that while Go and Tokio are slower, all runtime achieve decent latency.
+However, the yielding variation shows an entirely different picture.
+Since libfibre and tokio have a traditional work-stealing scheduler, \procs that have \ats on their local queues will never steal from other \procs.
+The result is that the experiement simply does not complete for these runtime.
+Without \procs stealing from the \proc running the leader, the experiment will simply never terminate.
+Go manages to complete the experiement because it adds preemption on top of classic work-stealing.
+However, since preemption is fairly costly it achieves significantly worst performance.
+In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness.
+Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle-sleep.

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

-              r741e22c
+              r71cf630
 \section{Naming Convention}
 Scheduling has been studied by various communities concentrating on different incarnation of the same problems.
 As a result, there are no standard naming conventions for scheduling that is respected across these communities.
+Scheduling has been studied by various communities concentrating on different incarnation of the same problems.
+As a result, there are no standard naming conventions for scheduling that is respected across these communities.
 This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
 …
 \section{Dynamic Scheduling}
 \newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all.
 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
+Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
 This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies.
 Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
 …
 Most common operating systems use some variant on priorities with overlaps and dynamic priority adjustments.
 For example, Microsoft Windows uses a pair of priorities
 \cit{https://docs.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities,https://docs.microsoft.com/en-us/windows/win32/taskschd/taskschedulerschema-priority-settingstype-element}, one specified by users out of ten possible options and one adjusted by the system.
+\cite{win:priority}, one specified by users out of ten possible options and one adjusted by the system.
 \subsection{Uninformed and Self-Informed Dynamic Schedulers}
 …
 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
+\todo{load balancing}
+In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth.
+Multicore scheduling is based on a combination of priorities, preferred \proc.
+Each \at is assigned an \newterm{ideal} \proc using a round-robin policy.
+\Gls{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on.
+This is similar to a variation of work stealing, where the stealing \proc restore the \at to its original \proc after running it, but with priorities added onto the mix.
 \paragraph{Apple OS X}
 …
 \paragraph{Go}\label{GoSafePoint}
 Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
 Preemption is present, but only at safe-points,~\cit{https://go.dev/src/runtime/preempt.go} which are inserted detection code at various frequent access boundaries.
+Preemption is present, but only at safe-points,~\cite{go:safepoints} which are inserted detection code at various frequent access boundaries.
 The algorithm is as follows :
 …
 \paragraph{Grand Central Dispatch}
 An Apple\cit{Official GCD source} API that offers task parallelism~\cite{wiki:taskparallel}.
+An Apple\cite{apple:gcd} API that offers task parallelism~\cite{wiki:taskparallel}.
 Its distinctive aspect is multiple ``Dispatch Queues'', some of which are created by programmers.
 Each queue has its own local ordering guarantees, \eg \ats on queue $A$ are executed in \emph{FIFO} order.
+\todo{load balancing and scheduling}
+% http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf
+In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
+While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests an approach fairly classic;
+Where each \proc has a queue of \newterm{blocks} to run, effectively \ats, and they drain their respective queues in \glsxtrshort{fifo}.
+They seem to add the concept of dependent queues with clear ordering, where a executing a block ends-up scheduling more blocks.
+In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics.
 \paragraph{LibFibre}

doc/theses/thierry_delisle_PhD/thesis/text/front.tex

-              r741e22c
+              r71cf630
 % D E C L A R A T I O N   P A G E
 % -------------------------------
 % The following is a sample Delaration Page as provided by the GSO
+% The following is a sample Declaration Page as provided by the GSO
 % December 13th, 2006.  It is designed for an electronic thesis.
 \noindent
 …
 User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
 The user-level approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
 Indeed, over-partitioning into small work-units significantly eases load balancing while providing user threads for each unit of work offers greater freedom to the programmer.
+The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
+Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
+which begs of the question of how many kernel threads are needed and when should the need be re-evaliated.
+Furthermore, the scheduler must prevent kernel threads from blocking, otherwise user-thread parallelism drops, and put idle kernel-threads to sleep to avoid wasted resources.
+which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
+Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
+When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
 otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur.
+otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for user-level threading.
+The predominant technique for manage high levels of concurrency is sharding the ready-queue with one queue per kernel-threads and using some form of work stealing/sharing to dynamically rebalance workload shifts.
+Fairness can be handled through preemption or ad-hoc solutions, which leads to coarse-grained fairness and pathological cases.
+The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
+After selecting specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system.
+Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
+After examining, selecting and testing specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system.
 \CFA is a modern extension of C using user-level threading as its fundamental threading model.
 As one of its primary goals, \CFA aims to offer increased safety and productivity without sacrificing performance.
 The new scheduler achieves this goal by demonstrating equivalent performance to work-stealing schedulers while offering better fairness.
+This is achieved through several optimization that successfully eliminate the cost of the additional fairness, some of these optimization relying on interesting hardware optimizations present on most modern cpus.
+This work also includes support for user-level \io, allowing programmers to have many more user-threads blocking on \io operations than there are \glspl{kthrd}.
+The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness.
+To complete the picture, the idle sleep mechanism that goes along is presented.
+The implementation uses several optimizations that successfully balance the cost of fairness against performance;
+some of these optimizations rely on interesting hardware optimizations present on modern CPUs.
+The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
+The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using @select@, @epoll@, \etc.
+To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
 \cleardoublepage
 …
 \begin{center}\textbf{Acknowledgements}\end{center}
+\todo{Acknowledgements}
+I would like to thank my supervisor, Professor Peter Buhr, for his guidance through my degree as well as the editing of this document.
+I would like to thank Professors Martin Karsten and Trevor Brown, for reading my thesis and providing helpful feedback.
+Thanks to Andrew Beach, Michael Brooks, Colby Parsons, Mubeen Zulfiqar, Fangren Yu and Jiada Liang for their work on the \CFA project as well as all the discussions which have helped me concretize the ideas in this thesis.
+Finally, I acknowledge that this has been possible thanks to the financial help offered by the David R. Cheriton School of Computer Science and the corporate partnership with Huawei Ltd.
 \cleardoublepage

doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

-              r741e22c
+              r71cf630
 \chapter{Introduction}\label{intro}
-\section{\CFA programming language}
+The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
+Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
+My previous master's thesis on concurrent in \CFA focused on features and interfaces.
+This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
+Specifically, this work concentrates on scheduling and \glsxtrshort{io}.
+Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}.
+\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
+The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
+Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
+To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
+which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
+Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
+When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
+Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
+otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
+As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
+While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
+This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}.
+The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
+Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
+Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
+After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
+The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
+The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
+Chapter~\ref{intro} defines scheduling and its general goals.
+Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others.
+Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation.
+Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness.
+Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}.
+Chapter~\ref{practice} presents the mechanisms needed to adjust the amount of parallelism, both manually and automatically.
+Chapters~\ref{microbench} and~\ref{macrobench} present micro and macro benchmarks used to evaluate and compare the new scheduler with similar schedulers.
 \section{Scheduling}
 Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones.
 On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}.
 These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit.
 On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.
 A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.
 With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}.
 However, optimal solutions are often not required.
 Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns.
+Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
+On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}.
+Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
+In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
+For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
+A general-purpose dynamic-scheduler for an open system cannot anticipate work requests, so its performance is rarely optimal.
+Even with complete knowledge of arrive order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}.
+However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
 Scheduling occurs at discreet points when there are transitions in a system.
 …
 \input{executionStates.pstex_t}
 \end{center}
 These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):
+These \newterm{state transition}s are initiated in response to events, \eg blocking, interrupts, errors:
 \begin{itemize}
 \item
 entering the system (new $\rightarrow$ ready)
+\item
+scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running)
 \item
 timer alarm for preemption (running $\rightarrow$ ready)
 …
 long term delay versus spinning (running $\rightarrow$ blocked)
 \item
 blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready)
+completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready)
 \item
+normal completion or error, \ie segment fault (running $\rightarrow$ halted)
+\item
+scheduler assigns a thread to a resource (ready $\rightarrow$ running)
+normal completion or error, \eg segment fault (running $\rightarrow$ halted)
 \end{itemize}
 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system.
+Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
 When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
 Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
+The structure of ready queues can take many different forms.
+Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
+The structure of ready queues can take many different forms, where the basic two are the single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
 \begin{center}
 \begin{tabular}{l|l}
 …
 \end{tabular}
 \end{center}
 Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS.
+Beyond these two schedulers are a host of options, \eg adding an global shared queue to MQMS or adding multiple private queues with distinc characteristics.
 The three major optimization criteria for a scheduler are:
+Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria:
 \begin{enumerate}[leftmargin=*]
 \item
 …
 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
 That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors.
+That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors.
 \item
+\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{
+Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.}
+\noindent
+Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object.
+\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking.\footnote{
+Lock-free data-structures do not involve locking but incur similar costs to achieve mutual exclusion.}
+Mutual exclusion cost and latency increases significantly with the number of processors access\-ing a shared object.
 \end{enumerate}
+Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns.
+Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}.
+SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
+MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
+Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
+Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.}
+Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
+For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
+While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
 Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.
+Significant research effort has looked at load balancing by stealing/sharing work units among queues: when a ready queue is too short or long, respectively, load stealing/sharing schedulers attempt to push/pull work units to/from other ready queues.
 These approaches attempt to perform better load-balancing at the cost of affinity and contention.
+Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues
+However, \emph{all} approaches come at a cost, but not all compromises are necessarily equivalent, especially across workloads.
+Hence, some schedulers perform very well for specific workloads, while others offer acceptable performance over a wider range of workloads.
+Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent.
+Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads.
+Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals.
+\section{\CFA programming language}
+The \CFA programming language~\cite{Cforall,Moss18} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
+Among its productivity features, \CFA supports \gls{uthrding}~\cite{Delisle21} as its fundamental threading model allowing programmers to easily write modern concurrent and parallel programs.
+My previous master's thesis on concurrency in \CFA focused on features and interfaces~\cite{Delisle18}.
+This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
+Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling.
+Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{
+C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}
+Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals.
 More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
+The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
+To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
+As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
+The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
+some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
+The \io implementation is based on the @io_uring@ kernel-interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
+This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc.
+While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility.
+The hope is that these new features will soon become mainstream features.
 \section{Contributions}\label{s:Contributions}
 This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:
+This work provides the following scheduling contributions for advanced \gls{uthrding} runtime-systems:
 \begin{enumerate}[leftmargin=*]
 \item
 A scalable scheduling algorithm that offers progress guarantees.
 \item
+Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
+\item
 An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
 \item
+Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
+A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
+\item
+An optimization of the helping-mechanism for load balancing to reduce scheduling costs.
+\item
+An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases.
 \end{enumerate}

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-              r741e22c
+              r71cf630
 \chapter{User Level \io}
+\chapter{User Level \io}\label{userio}
 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
 …
 Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work.
 \subsection{\lstinline{O_NONBLOCK}}
+\subsection{\lstinline{O_NONBLOCK}}\label{ononblock}
 In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''.
 In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
 …
 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
 This approach is used by languages like Go\cit{Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
+This approach is used by languages like Go\cite{GITHUB:go}, frameworks like libuv\cite{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
 As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
 …
 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cit{https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
+For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
 For this project, I selected @io_uring@, in large parts because of its generality.

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

-              r741e22c
+              r71cf630
 To achieve this goal requires each reader to have its own memory to mark as locked and unlocked.
 The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
 The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
+The write acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
 The lock in nonblocking, so both readers and writers spin while the lock is held.
+\todo{finish explanation}
+This very wide sharding strategy means that readers have very good locality, since they only ever need to access two memory location.
 \begin{figure}
 …
 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
 Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready.
 Since waking up a sleeping \proc can have notable latency, it is possible multiple \ats become ready while a single \proc is waking up.
 …
 \subsection{Event FDs}
 Another interesting approach is to use an event file descriptor\cit{eventfd}.
+Another interesting approach is to use an event file descriptor\cite{eventfd}.
 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.
 …
 \end{figure}
 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cit{benaphore} in front of the event @fd@.
+The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cite{schillings1996engineering} in front of the event @fd@.
 The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicit in Figure~\ref{fig:idle:state}.
 A \proc begins its idle sleep by adding itself to the idle list before searching for an \at.

doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

-              r741e22c
+              r71cf630
 \chapter{\CFA Runtime}
+\chapter{\CFA Runtime}\label{cfaruntime}
 This chapter presents an overview of the capabilities of the \CFA runtime prior to this thesis work.
 …
 Only UNIX @man@ pages identify whether or not a library function is thread safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
 Languages like Go and Java, which have strict interoperability with C\cit{JNI, GoLang with C}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
+Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences:

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: