Index: doc/theses/thierry_delisle_PhD/thesis/text_back/core.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/core.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/core.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,130 @@
+\chapter{Scheduling Core}\label{core}
+
+Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenerio, where the system has reached a steady state. For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the ressources necessary to accomplish the work, \eg, enough workers. In short, the system is neither overloaded nor underloaded.
+
+I believe it is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to the new load and return to the steady state, \eg, adding or removing workers. Flaws in the scheduling when the system is in the steady state can therefore to be pervasive in all states.
+
+\section{Design Goals}
+As with most of the design decisions behind \CFA, an important goal is to match the expectation of the programmer, according to their probable mental model. To match these expectations, the design must offer the programmers sufficient guarantees so that, as long as they respect the mental model, the system will also respect this model.
+
+For threading, a simple and common mental model is the ``Ideal multi-tasking CPU'' :
+
+\begin{displayquote}[Linux CFS\cit{https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt}]
+	{[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 tasks running, then it runs each at 50\% physical power --- i.e., actually in parallel.
+\end{displayquote}
+
+Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
+
+In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with eachother but simply share the hardware. This makes it easier to reason about threading because ready \glspl{thrd} can be taken in isolation and the effect of the scheduler can be virtually ignored. This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
+\begin{enumerate}
+	\item A fairness guarantee: a \gls{thrd} that is ready to run will not be prevented to do so by another thread.
+	\item A performance guarantee: a \gls{thrd} that wants to start or stop running will not be slowed down by other threads wanting to do the same.
+\end{enumerate}
+
+It is important to note that these guarantees are expected only up to a point. \Glspl{thrd} that are ready to run should not be prevented to do so, but they still need to share a limited amount of hardware. Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware, even if that share is very small.
+
+Similarly the performance guarantee, the lack of interferance between threads, is only relevant up to a point. Ideally the cost of running and blocking would be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. How much is an acceptable cost is obviously highly variable. For this document the performance experimentation will attempt to show that the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. This demonstration can be made by comparing application built in \CFA to applications built with other languages or other models. Recall from a few paragraphs ago that the expectation of programmers is that the impact of the scheduler can be ignored. Therefore, if the cost of scheduling is equivalent or lower to other popular languages, I will consider the guarantee achieved.
+
+More precisely the scheduler should be:
+\begin{itemize}
+	\item As fast as other schedulers that are less fair.
+	\item Faster than other scheduler that have equal or better fairness.
+\end{itemize}
+
+\subsection{Fairness vs Scheduler Locality}
+An important performance factor in modern architectures is cache locality. Waiting for data not present in the cache can have a major impact on performance, and having multiple \glspl{hthrd} writing to the same cache lines can lead to cache lines that need to be waited on again. It is therefore preferable to divide the data among each \gls{hthrd}\footnote{This can be an explicit division up front or using data structures where different \glspl{hthrd} are naturally routed to different cache lines.}.
+
+For a scheduler, having good locality\footnote{This section discusses \emph{internal} locality, \ie, the locality of the data used by the scheduler. \emph{External locality}, \ie, how the data used by the application is affected by scheduling, is a much more complicated subject and will be discussed in the chapters on evaluation.}, \ie, having the data be local to each \gls{hthrd}, generally conflicts with fairness. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving \gls{thrd}, and as a consequence cache lines, to \glspl{hthrd} that are currently more appropriate.
+
+However, I claim that in practice it is possible to strike a balance between fairness and performance because the need for these do not necessarily overlap temporaly. Figure~\ref{fig:fair} shows an visual representation of this behaviour. As mentionned, a little bit of unfairness can be acceptable, therefore it can be desirable to have an algorithm that prioritizes cache locality as long as no threads is left behind for too long.
+
+\begin{figure}
+	\begin{center}
+		\input{fairness.pstex_t}
+	\end{center}
+	\caption{Fairness vs Locality}
+	\label{fig:fair}
+	Rule of thumb graph: Importance of Fairness and Locality while a ready \gls{thrd} waits run.
+	As the time a ready \gls{thrd} waits increases, ``Ready Time'', the chances that its data is still in cache decreases. At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model mentionned above. Since the actual values and curves of this graph can be highly variable, the graph is left intentionally fuzzy and innacurate.
+\end{figure}
+
+\section{Design}
+A naive strictly \glsxtrshort{fifo} ready-queue does not offer sufficient performance. As shown in the evaluation sections, most production schedulers scale when adding multiple \glspl{hthrd} and that is not possible with a single point of contention. Therefore it is vital to shard the ready-queue so that multiple \glspl{hthrd} can access the ready-queue without performance degradation.
+
+\subsection{Sharding} \label{sec:sharding}
+An interesting approach to sharding a queue is presented in \cit{Trevors paper}. This algorithm represents a queue with relaxed \glsxtrshort{fifo} guarantee using an array of strictly \glsxtrshort{fifo} sublists as shown in Figure~\ref{fig:base}. Each cell of the array contains a linked-list with a lock and each node in these list is marked with a timestamp indicating when they were added to the list. Push operations are done by picking a random cell and attempting to push to its list. If the cell is already locked, the operation is simply retried on a new cell until a lock is acquired. Pop operations are done in a similar fashion except two random cells are picked. If both cells are not already locked and both cells contain non-empty lists, the operation pops the node with the oldest timestamp. If only one of the cell is unlocked and non-empty, the operation pops from that cell. If both cells are either locked or empty, the operation picks two new cells and tries again.
+
+\begin{figure}
+	\begin{center}
+		\input{base.pstex_t}
+	\end{center}
+	\caption{Relaxed FIFO list}
+	\label{fig:base}
+	List at the base of the scheduler: an array of strictly FIFO lists.
+	The timestamp is in all nodes and cell arrays.
+\end{figure}
+
+\subsection{Finding threads}
+Once threads have been distributed onto multiple queues, indentifying which queues are empty and which are not can become a problem. Indeed, if the number of \glspl{thrd} does not far exceed the number of queues, it is probable that several of these queues are empty. Figure~\ref{fig:empty} shows an example with 2 \glspl{thrd} running on 8 queues, where the chances of getting an empty queue is 75\% per pick, meaning two random picks yield a \gls{thrd} only half the time.
+
+
+\begin{figure}
+	\begin{center}
+		\input{empty.pstex_t}
+	\end{center}
+	\caption{``More empty'' Relaxed FIFO list}
+	\label{fig:empty}
+	Emptier state of the queue: the array contains many empty cells, that is strictly FIFO lists containing no elements.
+\end{figure}
+
+This can lead to performance problems since picks that do not yield a \gls{thrd} are not useful and do not necessarily help make more informed guesses.
+
+Solutions to this problem can take many forms, but they ultimately all have to encode where the threads are in some form. My results show that the density and locality of this encoding is generally the dominating factor in these scheme. Classic solutions to this problem use one of three techniques to encode the information:
+
+\begin{figure}
+	\begin{center}
+		{\resizebox{0.73\textwidth}{!}{\input{emptybit.pstex_t}}}
+	\end{center}
+	\vspace*{-5pt}
+	\caption{Underloaded queue with added bitmask to indicate which array cells have items.}
+	\label{fig:emptybit}
+	\begin{center}
+		{\resizebox{0.73\textwidth}{!}{\input{emptytree.pstex_t}}}
+	\end{center}
+	\vspace*{-5pt}
+	\caption{Underloaded queue with added binary search tree indicate which array cells have items.}
+	\label{fig:emptytree}
+	\begin{center}
+		{\resizebox{0.9\textwidth}{!}{\input{emptytls.pstex_t}}}
+	\end{center}
+	\vspace*{-5pt}
+	\caption{Underloaded queue with added per processor bitmask to indicate which array cells have items.}
+	\label{fig:emptytls}
+\end{figure}
+
+\paragraph{Dense Information} Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. This approach means processors can often find \glspl{thrd} in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but the look-up will nolonger be constant time. Finally, a dense bitmap, either single or multi-word, causes additional contention problems which reduces performance because of cache misses after updates. This central update bottleneck also means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available \glspl{thrd} but none on queue.
+
+\paragraph{Sparse Information} Figure~\ref{fig:emptytree} shows an approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}. However, this approach may lead to poorer performance due to the inherent pointer chasing cost while still allowing more contention on the nodes of the tree if the tree is not deep enough.
+
+\paragraph{Local Information} Figure~\ref{fig:emptytls} shows an approach using dense information, similar to the bitmap, but have each thread keep its own independent copy of it. While this approach can offer good scalability \emph{and} low latency, the liveliness and discovery of the information can become a problem. This case is made worst in systems with few processors where even blind random picks can find \glspl{thrd} in few tries.
+
+I built a prototype of these approach and none of these techniques offer satisfying performance when few threads are present. All of these approach hit the same 2 problems. First, blindly picking two sub-queues is very fast which means that any improvement to the hit rate can easily be countered by a slow-down in look-up speed. Second, the array is already as sharded as is needed to avoid contention bottlenecks, so any denser data structure will tend to become a bottleneck. In all cases, these factors meant that the best cases scenerio, many threads, would get worst throughput and the worst case scenario, few threads, would get a better hit rate, but an equivalent throughput. As a result I tried an entirely different approach.
+
+\subsection{Dynamic Entropy}\cit{https://xkcd.com/2318/}
+In the worst case scenario there are few \glspl{thrd} ready to run, or more accuratly given $P$ \glspl{proc}, $T$ \glspl{thrd} and $\epsilon$, as usual, a very small number, in this case $\epsilon \ll P$, we have $T = P + \epsilon$. An important thing to note is that in this case, fairness is effectively irrelevant. Indeed, this case is close to \emph{actually matching} the model of the ``Ideal multi-tasking CPU'' presented in this chapter\footnote{For simplicity, this assumes there is a one-to-one match between \glspl{proc} and \glspl{hthrd}.}. Therefore, in this context it is possible to use a purely internal locality based approach and still meet the fairness requirements. This approach would simply have each \gls{proc} running a single \gls{thrd} repeatedly. Or from the shared ready-queue viewpoint, each \gls{proc} would push to a given sub-queue and then pop from the \emph{same} subqueue. Ideally, the scheduler would achieve this without affecting the fairness guarantees in cases where $T \gg P$.
+
+To achieve this, I use a communication channel I have not mentionned yet and which I believe I use in a novel way : the \glsxtrshort{prng}. If the scheduler has a \glsxtrshort{prng} instance per \gls{proc} exclusively used for scheduling, its seed effectively encodes a list of all the accessed subqueues, from the latest to the oldest. The only requirement to achieve this is to be able to ``replay'' the \glsxtrshort{prng} backwards. As it turns out, this is an entirely reasonnable requirement and there already exist \glsxtrshort{prng}s that are fast, compact \emph{and} can be run forward and backwards. Linear congruential generators\cite{wiki:lcg} are an example of \glsxtrshort{prng}s that match these requirements.
+
+The algorithm works as follows :
+\begin{itemize}
+	\item Each \gls{proc} has two \glsxtrshort{prng} instances, $F$ and $B$.
+	\item Push and Pop operations happen as mentionned in Section~\ref{sec:sharding} with the following exceptions:
+	\begin{itemize}
+		\item Push operations use $F$ going forward on each try and on success $F$ is copied into $B$.
+		\item Pop operations use $B$ going backwards on each try.
+	\end{itemize}
+\end{itemize}
+
+The main benefit of this technique is that it basically repects the desired properties of Figure~\ref{fig:fair}. When looking for work, \glspl{proc} will look first at the last cells they pushed to, if any, and then move backwards through the cells. As the \glspl{proc} continue looking for work, $F$ moves back and $B$ stays in place. As a result the relation between the two becomes weaker, which means that the probablisitic fairness of the algorithm reverts to normal. Chapter~\ref{proofs} discusses more formally the fairness guarantees of this algorithm.
+
+\section{Details}
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/existing.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/existing.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/existing.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,137 @@
+\chapter{Previous Work}\label{existing}
+Scheduling is a topic with a very long history, predating its use in computer science. As such, early work in computed science was inspired from other fields and focused principally on solving scheduling upfront rather that as the system is running.
+
+\section{Naming Convention}
+Scheduling has been studied by various different communities concentrating on different incarnation of the same problems. As a result, their is no real naming convention for scheduling that is respected across these communities. For this document, I will use the term \newterm{task} to refer to the abstract objects being scheduled and the term \newterm{worker} to refer to the objects which will execute these tasks.
+
+\section{Static Scheduling}
+Static schedulers require that programmers explicitly and exhaustively specify dependencies among tasks in order to schedule them. The scheduler then processes this input ahead of time and producess a \newterm{schedule} to which the system can later adhere. An example application for these schedulers
+
+In general, static schedulers are less relavant to this project since they require input from the programmers that \CFA does not have as part of its concurrency semantic.
+\todo{Rate-monotonic scheduling}
+
+
+\section{Dynamic Scheduling}
+It may be difficult to fulfill the requirements of static scheduler if dependencies are be conditionnal. In this case, it may be preferable to detect dependencies at runtime. This detection effectively takes the form of halting or suspending a task with unfulfilled dependencies and adding one or more new task(s) to the system. The new task(s) have the responsability of adding the dependent task back in the system once completed. As a consequence, the scheduler may have an incomplete view of the system, seeing only tasks we no pending dependencies. Schedulers that support this detection at runtime are referred to as \newterm{Dynamic Schedulers}.
+
+\subsection{Explicitly Informed Dynamic Schedulers}
+While dynamic schedulers do not have access to an exhaustive list of dependencies for a task, they may require to provide more or less information about each task, including for example: expected duration, required ressources, relative importance, etc. The scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information} Precisely providing this information can be difficult for programmers, especially \emph{predicted} behaviour, and the scheduler may need to support some amount of imprecision in the provided information. For example, specifying that a tasks takes approximately 5 seconds to complete, rather than exactly 5 seconds. User provided information can also become a significant burden depending how the effort to provide the information scales with the number of tasks and there complexity. For example, providing an exhaustive list of files read by 5 tasks is an easier requirement the providing an exhaustive list of memory addresses accessed by 10'000 distinct tasks.
+
+Since the goal of this thesis is to provide a scheduler as a replacement for \CFA's existing \emph{uninformed} scheduler, Explicitly Informed schedulers are less relevant to this project. Nevertheless, some strategies are worth mentionnding.
+
+\subsubsection{Prority Scheduling}
+A commonly used information that schedulers used to direct the algorithm is priorities. Each Task is given a priority and higher-priority tasks are preferred to lower-priority ones. The simplest priority scheduling algorithm is to simply require that every task have a distinct pre-established priority and always run the available task with the highest priority. Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of tasks. It can therefore be diserable for schedulers to support tasks with identical priorities and/or automatically setting and adjusting priorites for tasks.
+
+\subsection{Uninformed and Self-Informed Dynamic Schedulers}
+Several scheduling algorithms do not require programmers to provide additionnal information on each task, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
+
+
+\subsubsection{Feedback Scheduling}
+As mentionned, Schedulers may also gather information about each tasks to direct their decisions. This design effectively moves the scheduler to some extent into the realm of \newterm{Control Theory}\cite{wiki:controltheory}. This gathering does not generally involve programmers and as such does not increase programmer burden the same way explicitly provided information may. However, some feedback schedulers do offer the option to programmers to offer additionnal information on certain tasks, in order to direct scheduling decision. The important distinction being whether or not the scheduler can function without this additionnal information.
+
+Feedback scheduler
+
+
+\section{Work Stealing}
+One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work-stealing. This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker work on its local tasks first, but allows the possibility for other workers to steal local tasks if they run out of tasks. \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has queue of tasks to accomplish and workers without tasks steal tasks from random workers. (The Burton and Sleep algorithm had trees of tasks and stole only among neighbours). Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations.
+
+Many variations of this algorithm have been proposed over the years\cite{DBLP:journals/ijpp/YangH18}, both optmizations of existing implementations and approaches that account for new metrics.
+
+\paragraph{Granularity} A significant portion of early Work Stealing research was concentrating on \newterm{Implicit Parellelism}\cite{wiki:implicitpar}. Since the system was responsible to split the work, granularity is a challenge that cannot be left to the programmers (as opposed to \newterm{Explicit Parellelism}\cite{wiki:explicitpar} where the burden can be left to programmers). In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead. The best performance generally means finding a middle ground between the two. Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
+
+\paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating tasks from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93}
+
+\todo{The survey is not great on this subject}
+
+\paragraph{Complex Machine Architecture} Another aspect that has been looked at is how well Work Stealing is applicable to different machine architectures.
+
+\subsection{Theoretical Results}
+There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogenous systems\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. \cite{DBLP:journals/jacm/BlellochGM99} examine the space bounds of Work Stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} show that for underloaded systems, the scheduler will complete computations in finite time, \ie is \newterm{stable}. Others show that Work-Stealing is applicable to various scheduling contexts\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. \cite{DBLP:conf/ipps/ColeR13} also studied how Randomized Work Stealing affects false sharing among tasks.
+
+However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentionning that this theoretical research has mainly focused on ``fully-strict'' computations, \ie workloads that can be fully represented with a Direct Acyclic Graph. It is unclear how well these distributions represent workloads in real world scenarios.
+
+\section{Preemption}
+One last aspect of scheduling worth mentionning is preemption since many schedulers rely on it for some of their guarantees. Preemption is the idea of interrupting tasks that have been running for too long, effectively injecting suspend points in the applications. There are multiple techniques to achieve this but they all aim to have the effect of guaranteeing that suspend points in a task are never further apart than some fixed duration. While this helps schedulers guarantee that no tasks will unfairly monopolize a worker, preemption can effectively added to any scheduler. Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it.
+
+\section{Schedulers in Production}\label{existing:prod}
+This section will show a quick overview of several schedulers which are generally available a the time of writing. While these schedulers don't necessarily represent to most recent advances in scheduling, they are what is generally accessible to programmers. As such, I believe that these schedulers are at least as relevant as those presented in published work. I chose both schedulers that operating in kernel space and in user space, as both can offer relevant insight for this project. However, I did not list any schedulers aimed for real-time applications, as these have constraints that are much stricter than what is needed for this project.
+
+\subsection{Operating System Schedulers}
+Operating System Schedulers tend to be fairly complex schedulers, they generally support some amount of real-time, aim to balance interactive and non-interactive tasks and support for multiple users sharing hardware without requiring these users to cooperate. Here are more details on a few schedulers used in the common operating systems: Linux, FreeBsd, Microsoft Windows and Apple's OS X. The information is less complete for operating systems behind closed source.
+
+\paragraph{Linux's CFS}
+The default scheduler used by Linux (the Completely Fair Scheduler)\cite{MAN:linux/cfs,MAN:linux/cfs2} is a feedback scheduler based on CPU time. For each processor, it constructs a Red-Black tree of tasks waiting to run, ordering them by amount of CPU time spent. The scheduler schedules the task that has spent the least CPU time. It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time spent. The ordering of tasks is also impacted by a group based notion of fairness, where tasks belonging to groups having spent less CPU time are preferred to tasks beloning to groups having spent more CPU time. Linux achieves load-balancing by regularly monitoring the system state\cite{MAN:linux/cfs/balancing} and using some heuristic on the load (currently CPU time spent in the last millisecond plus decayed version of the previous time slots\cite{MAN:linux/cfs/pelt}.).
+
+\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work-stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly. The issues highlighted sem to stem from Linux's need to support fairness across tasks \emph{and} across users\footnote{Enforcing fairness across users means, for example, that given two users: one with a single task and the other with one thousand tasks, the user with a single task does not receive one one thousandth of the CPU time.}, increasing the complexity.
+
+Linux also offers a FIFO scheduler, a real-time schedulerwhich runs the highest-priority task, and a round-robin scheduler, which is an extension of the fifo-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
+
+\paragraph{FreeBSD}
+The ULE scheduler used in FreeBSD\cite{DBLP:conf/bsdcon/Roberson03} is a feedback scheduler similar to Linux's CFS. It uses different data structures and heuristics but also schedules according to some combination of CPU time spent and niceness values. It also periodically balances the load of the system(according to a different heuristic), but uses a simpler Work Stealing approach.
+
+\paragraph{Windows(OS)}
+Microsoft's Operating System's Scheduler\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. It supports 32 levels of priorities, some of which are reserved for real-time and prviliged applications. It schedules tasks based on the highest priorities (lowest number) and how much cpu time each tasks have used. The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
+
+\todo{load balancing}
+
+\paragraph{Apple OS X}
+Apple programming documentation describes its kernel scheduler as follows:
+\begin{displayquote}
+	OS X is based on Mach and BSD. [...]. It contains an advanced scheduler based on the CMU Mach 3 scheduler. [...] Mach scheduling is based on a system of run queues at various priorities.
+
+	\begin{flushright}
+		-- Kernel Programming Guide \cite{MAN:apple/scheduler}
+	\end{flushright}
+\end{displayquote}
+
+\todo{load balancing}
+
+\subsection{User-Level Schedulers}
+By comparison, user level schedulers tend to be simpler, gathering fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all tasks have the same user, and therefore cooperation is both feasible and probable.
+\paragraph{Go}
+Go's scheduler uses a Randomized Work Stealing algorithm that has a global runqueue(\emph{GRQ}) and each processor(\emph{P}) has both a fixed-size runqueue(\emph{LRQ}) and a high-priority next ``chair'' holding a single element.\cite{GITHUB:go,YTUBE:go} Preemption is present, but only at function call boundaries.
+
+The algorithm is as follows :
+\begin{enumerate}
+	\item Once out of 61 times, directly pick 1 element from the \emph{GRQ}.
+	\item If there is an item in the ``chair'' pick it.
+	\item Else pick an item from the \emph{LRQ}.
+	\item If it was empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}.
+	\item If it was empty steal \emph{half} the \emph{LRQ} of another \emph{P} chosen randomly.
+\end{enumerate}
+
+\paragraph{Erlang}
+Erlang is a functionnal language that supports concurrency in the form of processes, threads that share no data. It seems to be some kind of Round-Robin Scheduler. It currently uses some mix of Work Sharing and Work Stealing to achieve load balancing\cite{:erlang}, where underloaded workers steal from other workers, but overloaded workers also push work to other workers. This migration logic seems to be directed by monitoring logic that evaluates the load a few times per seconds.
+
+\paragraph{Intel\textregistered ~Threading Building Blocks}
+\newterm{Thread Building Blocks}(TBB) is Intel's task parellelism\cite{wiki:taskparallel} framework. It runs tasks or \newterm{jobs}, schedulable objects that must always run to completion, on a pool of worker threads. TBB's scheduler is a variation of Randomized Work Stealing that also supports higher-priority graph-like dependencies\cite{MAN:tbb/scheduler}. It schedules tasks as follows (where \textit{t} is the last task completed):
+\begin{displayquote}
+	\begin{enumerate}
+		\item The task returned by \textit{t}\texttt{.execute()}
+		\item The successor of t if \textit{t} was its last completed predecessor.
+		\item A task popped from the end of the thread’s own deque.
+		\item A task with affinity for the thread.
+		\item A task popped from approximately the beginning of the shared queue.
+		\item A task popped from the beginning of another randomly chosen thread’s deque.
+	\end{enumerate}
+
+	\begin{flushright}
+		-- Intel\textregistered ~TBB documentation \cite{MAN:tbb/scheduler}
+	\end{flushright}
+\end{displayquote}
+
+\paragraph{Quasar/Project Loom}
+Java has two projects that are attempting to introduce lightweight threading into java in the form of Fibers, Quasar\cite{MAN:quasar} and Project Loom\cite{MAN:project-loom}\footnote{It is unclear to me if these are distinct projects or not}. Both projects seem to be based on the \texttt{ForkJoinPool} in Java which appears to be a simple incarnation of Randomized Work Stealing\cite{MAN:java/fork-join}.
+
+\paragraph{Grand Central Dispatch}
+This is an API produce by Apple\cit{Official GCD source} that offers task parellelism\cite{wiki:taskparallel}. Its distinctive aspect is that it uses multiple ``Dispatch Queues'', some of which are created by programmers. These queues each have their own local ordering guarantees, \eg tasks on queue $A$ are executed in \emph{FIFO} order.
+
+\todo{load balancing and scheduling}
+
+% http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf
+
+In terms of semantics, the Dispatch Queues seem to be very similar in semantics to Intel\textregistered ~TBB \texttt{execute()} and predecessor semantics. Where it would be possible to convert from one to the other.
+
+\paragraph{LibFibre}
+LibFibre\cite{DBLP:journals/pomacs/KarstenB20} is a light-weight user-level threading framework developt at the University of Waterloo. Similarly to Go, it uses a variation of Work Stealing with a global queue that is higher priority than stealing. Unlock Go it does not have the high-priority next ``chair'' and does not use Randomized Work Stealing.
+
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/front.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/front.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/front.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,196 @@
+% T I T L E   P A G E
+% -------------------
+% Last updated June 14, 2017, by Stephen Carr, IST-Client Services
+% The title page is counted as page `i' but we need to suppress the
+% page number. Also, we don't want any headers or footers.
+\pagestyle{empty}
+\pagenumbering{roman}
+
+% The contents of the title page are specified in the "titlepage"
+% environment.
+\begin{titlepage}
+	\begin{center}
+		\vspace*{1.0cm}
+
+		\Huge
+		{\bf The \CFA Scheduler}
+
+		\vspace*{1.0cm}
+
+		\normalsize
+		by \\
+
+		\vspace*{1.0cm}
+
+		\Large
+		Thierry Delisle \\
+
+		\vspace*{3.0cm}
+
+		\normalsize
+		A thesis \\
+		presented to the University of Waterloo \\
+		in fulfillment of the \\
+		thesis requirement for the degree of \\
+		Doctor of Philosophy \\
+		in \\
+		Computer Science \\
+
+		\vspace*{2.0cm}
+
+		Waterloo, Ontario, Canada, 2021 \\
+
+		\vspace*{1.0cm}
+
+		\copyright\ Thierry Delisle 2021 \\
+	\end{center}
+\end{titlepage}
+
+% The rest of the front pages should contain no headers and be numbered using Roman numerals starting with `ii'
+\pagestyle{plain}
+\setcounter{page}{2}
+
+\cleardoublepage % Ends the current page and causes all figures and tables that have so far appeared in the input to be printed.
+% In a two-sided printing style, it also makes the next page a right-hand (odd-numbered) page, producing a blank page if necessary.
+
+
+% E X A M I N I N G   C O M M I T T E E (Required for Ph.D. theses only)
+% Remove or comment out the lines below to remove this page
+\begin{center}\textbf{Examining Committee Membership}\end{center}
+\noindent
+	The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote.
+	\todo{External Examiners}
+\bigskip
+
+\noindent
+\begin{tabbing}
+	Internal-External Member: \=  \kill % using longest text to define tab length
+	External Examiner: \>  TBD \\
+	\> TBD \\
+\end{tabbing}
+\bigskip
+
+\noindent
+\begin{tabbing}
+	Internal-External Member: \=  \kill % using longest text to define tab length
+	Supervisor(s): \> Peter Buhr \\
+	\> Associate Professor, School of Computer Science \\
+	\> University of Waterloo \\
+\end{tabbing}
+\bigskip
+
+\noindent
+\begin{tabbing}
+	Internal-External Member: \=  \kill % using longest text to define tab length
+	Internal Member: \> Trevor Brown \\
+	\> Assistant Professor, School of Computer Science \\
+	\> University of Waterloo \\
+	\\
+	Internal Member: \> Martin Karsten \\
+	\> Associate Professor, School of Computer Science \\
+	\> University of Waterloo \\
+\end{tabbing}
+\bigskip
+
+\noindent
+\begin{tabbing}
+	Internal-External Member: \=  \kill % using longest text to define tab length
+	Internal-External Member: \> TBD \\
+	\> TBD \\
+	\> University of Waterloo \\
+\end{tabbing}
+\bigskip
+
+\cleardoublepage
+
+% D E C L A R A T I O N   P A G E
+% -------------------------------
+% The following is a sample Delaration Page as provided by the GSO
+% December 13th, 2006.  It is designed for an electronic thesis.
+\noindent
+I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners.
+
+\bigskip
+
+\noindent
+I understand that my thesis may be made electronically available to the public.
+
+\cleardoublepage
+
+% A B S T R A C T
+% ---------------
+
+\begin{center}\textbf{Abstract}\end{center}
+
+This is the abstract.
+
+Vulputate minim vel consequat praesent at vel iusto et, ex delenit, esse euismod luptatum augue ut sit et eu vel augue autem feugiat, quis ad dolore. Nulla vel, laoreet lobortis te commodo elit qui aliquam enim ex iriure ea ullamcorper nostrud lorem, lorem laoreet eu ex ut vel in zzril wisi quis. Nisl in autem praesent dignissim, sit vel aliquam at te, vero dolor molestie consequat.
+
+Tation iriure sed wisi feugait odio dolore illum duis in accumsan velit illum consequat consequat ipsum molestie duis duis ut ullamcorper. Duis exerci odio blandit vero dolore eros odio amet et nisl in nostrud consequat iusto eum suscipit autem vero. Iusto dolore exerci, ut erat ex, magna in facilisis duis amet feugait augue accumsan zzril delenit aliquip dignissim at. Nisl molestie nibh, vulputate feugait nibh luptatum ea delenit nostrud dolore minim veniam odio volutpat delenit nulla accumsan eum vero ullamcorper eum. Augue velit veniam, dolor, exerci ea feugiat nulla molestie, veniam nonummy nulla dolore tincidunt, consectetuer dolore nulla ipsum commodo.
+
+At nostrud lorem, lorem laoreet eu ex ut vel in zzril wisi. Suscipit consequat in autem praesent dignissim, sit vel aliquam at te, vero dolor molestie consequat eros tation facilisi diam dolor. Odio luptatum dolor in facilisis et facilisi et adipiscing suscipit eu iusto praesent enim, euismod consectetuer feugait duis. Odio veniam et iriure ad qui nonummy aliquip at qui augue quis vel diam, nulla. Autem exerci tation iusto, hendrerit et, tation esse consequat ut velit te dignissim eu esse eros facilisis lobortis, lobortis hendrerit esse dignissim nisl. Nibh nulla minim vel consequat praesent at vel iusto et, ex delenit, esse euismod luptatum.
+
+Ut eum vero ullamcorper eum ad velit veniam, dolor, exerci ea feugiat nulla molestie, veniam nonummy nulla. Elit tincidunt, consectetuer dolore nulla ipsum commodo, ut, at qui blandit suscipit accumsan feugiat vel praesent. In dolor, ea elit suscipit nisl blandit hendrerit zzril. Sit enim, et dolore blandit illum enim duis feugiat velit consequat iriure sed wisi feugait odio dolore illum duis. Et accumsan velit illum consequat consequat ipsum molestie duis duis ut ullamcorper nulla exerci odio blandit vero dolore eros odio amet et.
+
+In augue quis vel diam, nulla dolore exerci tation iusto, hendrerit et, tation esse consequat ut velit. Duis dignissim eu esse eros facilisis lobortis, lobortis hendrerit esse dignissim nisl illum nulla minim vel consequat praesent at vel iusto et, ex delenit, esse euismod. Nulla augue ut sit et eu vel augue autem feugiat, quis ad dolore te vel, laoreet lobortis te commodo elit qui aliquam enim ex iriure. Ut ullamcorper nostrud lorem, lorem laoreet eu ex ut vel in zzril wisi quis consequat in autem praesent dignissim, sit vel. Dolore at te, vero dolor molestie consequat eros tation facilisi diam. Feugait augue luptatum dolor in facilisis et facilisi et adipiscing suscipit eu iusto praesent enim, euismod consectetuer feugait duis vulputate veniam et.
+
+Ad eros odio amet et nisl in nostrud consequat iusto eum suscipit autem vero enim dolore exerci, ut. Esse ex, magna in facilisis duis amet feugait augue accumsan zzril. Lobortis aliquip dignissim at, in molestie nibh, vulputate feugait nibh luptatum ea delenit nostrud dolore minim veniam odio. Euismod delenit nulla accumsan eum vero ullamcorper eum ad velit veniam. Quis, exerci ea feugiat nulla molestie, veniam nonummy nulla. Elit tincidunt, consectetuer dolore nulla ipsum commodo, ut, at qui blandit suscipit accumsan feugiat vel praesent.
+
+Dolor zzril wisi quis consequat in autem praesent dignissim, sit vel aliquam at te, vero. Duis molestie consequat eros tation facilisi diam dolor augue. Dolore dolor in facilisis et facilisi et adipiscing suscipit eu iusto praesent enim, euismod consectetuer feugait duis vulputate.
+
+\cleardoublepage
+
+% A C K N O W L E D G E M E N T S
+% -------------------------------
+
+\begin{center}\textbf{Acknowledgements}\end{center}
+
+\todo{Acknowledgements}
+\cleardoublepage
+
+% D E D I C A T I O N
+% -------------------
+
+% \begin{center}\textbf{Dedication}\end{center}
+
+% This is dedicated to the one I love.
+% \cleardoublepage
+
+% T A B L E   O F   C O N T E N T S
+% ---------------------------------
+\renewcommand\contentsname{Table of Contents}
+\tableofcontents
+\cleardoublepage
+\phantomsection    % allows hyperref to link to the correct page
+
+% L I S T   O F   T A B L E S
+% ---------------------------
+\addcontentsline{toc}{chapter}{List of Tables}
+\listoftables
+\cleardoublepage
+\phantomsection		% allows hyperref to link to the correct page
+
+% L I S T   O F   F I G U R E S
+% -----------------------------
+\addcontentsline{toc}{chapter}{List of Figures}
+\listoffigures
+\cleardoublepage
+\phantomsection		% allows hyperref to link to the correct page
+
+% GLOSSARIES (Lists of definitions, abbreviations, symbols, etc. provided by the glossaries-extra package)
+% -----------------------------
+\printglossary[type=\acronymtype]
+\cleardoublepage
+\phantomsection		% allows hyperref to link to the correct page
+
+% TODOs and missing citations
+% -----------------------------
+\listofcits
+\listoftodos
+\cleardoublepage
+\phantomsection		% allows hyperref to link to the correct page
+
+
+% Change page numbering back to Arabic numerals
+\pagenumbering{arabic}
+
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/intro.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/intro.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/intro.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,9 @@
+\chapter*{Introduction}\label{intro}
+\todo{A proper intro}
+
+The C programming language\cit{C}
+
+The \CFA programming language\cite{cfa:frontpage,cfa:typesystem} which extends the C programming language to add modern safety and productiviy features while maintaining backwards compatibility. Among it's productiviy features, \CFA introduces support for threading\cit{CFA Concurrency}, to allow programmers to write modern concurrent and parallel programming.
+While previous work on the concurrent package of \CFA focused on features and interfaces, this thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. More specifically, this thesis concentrates on scheduling and \glsxtrshort{io}. Prior to this work, the \CFA runtime used a strictly \glsxtrshort{fifo} \gls{rQ}.
+
+This work exclusively concentrates on Linux as it's operating system since the existing \CFA runtime and compiler does not already support other operating systems. Furthermore, as \CFA is yet to be released, supporting version of Linux older that the latest version is not a goal of this work.
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/io.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/io.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/io.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,46 @@
+\chapter{User Level \glsxtrshort{io}}
+As mentionned in Section~\ref{prev:io}, User-Level \glsxtrshort{io} requires multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \glsxtrshort{io} operations. Various operating systems offer various forms of asynchronous operations and as mentioned in Chapter~\ref{intro}, this work is exclusively focuesd on Linux.
+
+\section{Existing options}
+Since \glsxtrshort{io} operations are generally handled by the
+
+\subsection{\texttt{epoll}, \texttt{poll} and \texttt{select}}
+
+\subsection{Linux's AIO}
+
+
+
+\begin{displayquote}
+	AIO is a horrible ad-hoc design, with the main excuse being "other,
+	less gifted people, made that design, and we are implementing it for
+	compatibility because database people - who seldom have any shred of
+	taste - actually use it".
+
+	But AIO was always really really ugly.
+
+	\begin{flushright}
+		-- Linus Torvalds\cit{https://lwn.net/Articles/671657/}
+	\end{flushright}
+\end{displayquote}
+
+Interestingly, in this e-mail answer, Linus goes on to describe
+``a true \textit{asynchronous system call} interface''
+that does
+``[an] arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread''
+in
+``some kind of arbitrary \textit{queue up asynchronous system call} model''.
+This description is actually quite close to the interface of the interface described in the next section.
+
+\subsection{\texttt{io\_uring}}
+A very recent addition to Linux, \texttt{io\_uring}\cit{io\_uring} is a framework that aims to solve many of the problems listed with the above mentioned solutions.
+
+\subsection{Extra Kernel Threads}\label{io:morethreads}
+Finally, if the operating system does not offer any satisfying forms of asynchronous \glsxtrshort{io} operations, a solution is to fake it by creating a pool of \glspl{kthrd} and delegating operations to them in order to avoid blocking \glspl{proc}.
+
+\subsection{Discussion}
+
+
+\section{Event-Engine}
+
+
+\section{Interface}
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/practice.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/practice.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/practice.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,9 @@
+\chapter{Scheduling in practice}\label{practice}
+The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state.
+However, it does not address problems that occur when the system changes state.
+Indeed the \CFA runtime, supports expanding and shrinking the number of KTHREAD\_place \todo{add kthrd to glossary}, both manually and, to some extent automatically.
+This entails that the scheduling algorithm must support these transitions.
+
+\section{Resizing}
+
+\section{Idle-Sleep}
Index: doc/theses/thierry_delisle_PhD/thesis/text_back/runtime.tex
===================================================================
--- doc/theses/thierry_delisle_PhD/thesis/text_back/runtime.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
+++ doc/theses/thierry_delisle_PhD/thesis/text_back/runtime.tex	(revision aaa1c4ccd9df86bf8fb15ffd83be94da071dc8dd)
@@ -0,0 +1,44 @@
+\chapter{\CFA Runtime}
+This chapter offers an overview of the capabilities of the \CFA runtime prior to this work.
+
+Threading in \CFA offers is based on \Gls{uthrding}, where \glspl{thrd} are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, that is: programmers should be able to create a large number of \glspl{thrd} and switch between \glspl{thrd} liberally without many concerns for performance.
+
+\section{M:N Threading}\label{prev:model}
+
+C traditionnally uses a 1:1 threading model. This model uses \glspl{kthrd} to achive parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel. The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking. This also entails that the kernel has a perfect view of every thread executing in the system\footnote{This is not completly true due to primitives like \texttt{futex}es, which have a significant portion of their logic in user space.}.
+
+By contrast \CFA uses an M:N threading models, where concurrency is achieved using many user-level threads mapped onto fewer \glspl{kthrd}. The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model, they represent an independant thread of execution with it's on stack. The difference is that user-level threads do not have a corresponding object in the kernel, they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \gls{thrd} until it context switches out, it then choses a different \gls{thrd} to run.
+
+\section{Clusters}
+\begin{figure}
+	\begin{center}
+		\input{system.pstex_t}
+	\end{center}
+	\caption{Overview of the \CFA runtime}
+	\label{fig:system}
+	\Glspl{thrd} are scheduled inside a particular cluster, where it only runs on the \glspl{proc} which belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{kthrd} which lives outside any cluster and does not run \glspl{thrd}.
+\end{figure}
+\CFA allows the option to group user-level threading, in the form of clusters. Both \glspl{thrd} and \glspl{proc} belong to a specific cluster. \Glspl{thrd} will only be scheduled onto \glspl{proc} in the same cluster and scheduling is done independantly of other clusters. Figure~\ref{fig:system} shows an overview if this system. This allows programmers to control more tightly parallelism. It also opens the door to handling effects like NUMA, by pining clusters to specific NUMA node\footnote{This is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for cpu masks.}.
+
+\section{Scheduling}
+The \CFA runtime was previously using a strictly \glsxtrshort{fifo} ready queue with a single lock. This setup offers perfect fairness in terms of opportunities to run/ However, it offers poor scalability, since the performance of the ready queue can never be improved by adding more \glspl{hthrd}, but the contention can cause significant performance degradation.
+
+\section{\glsxtrshort{io}}\label{prev:io}
+Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. \CFA being built on C, this means that, while all the operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model\cit{pthreads}. Using these operations in a M:N threading model, when they are built for 1:1 threading, means that operations block \glspl{proc} instead of \glspl{thrd}. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \glspl{thrd}. This also means that deadlocks can occur because all \glspl{proc} are blocked even if at least one \gls{thrd} is ready to run. A simple example of this type of deadlock would be as follows:
+
+Given a simple network program with 2 \glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd} waits for response from the server. If the second \gls{thrd} races ahead, it may wait for responses to requests that have not been sent yet. In theory, this should not be a problem, even if the second \gls{thrd} waits, the first \gls{thrd} is still ready to run and should just be able to get CPU time and send the request. In practice with M:N threading, while the first \gls{thrd} is ready, the lone \gls{proc} in this example will \emph{not} try to run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}. If this happen, the system is effectively deadlocked\footnote{In this example, the deadlocked could be resolved if the server sends unprompted messages to the client. However, this solution is not general and may not be appropriate even in this simple case.}.
+
+One of the objective of this work, is to introduce \emph{User-Level \glsxtrshort{io}} which, as a parallel to \glslink{uthrding}{User-Level \emph{Threading}}, blocks \glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} operations. This entails multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc}. This multiplexing requires that a single \gls{proc} be able to execute multiple operations in parallel. This cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its duration. Executing operations in parallel requires \emph{asynchronous} \glsxtrshort{io}, sometimes referred to as \emph{non-blocking}, since the \gls{kthrd} is not blocked.
+
+\section{Interoperating with \texttt{C}}
+While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the challenges mentioned in the previous section do not require \glsxtrshort{io} to be involved. These challenges are a product of blocking system calls rather than \glsxtrshort{io}. C offers no tools to identify whether or not a librairy function will lead to a blocking system call. This fact means interoperatability with C becomes a challenge in a M:N threading model.
+
+Languages like Go and Java, which have strict interoperatability with C\cit{JNI, GoLang with C}, can control operations in C by ``sandboxing'' them. They can, for example, delegate C operations to \glspl{kthrd} that are not \glspl{proc}. Sandboxing may help towards guaranteeing that the deadlocks mentioned in the previous section do not occur.
+
+As mentioned in Section~\cit{\CFA intro}, \CFA is binary compatible with C and, as such, trivially supports calls to and from C librairies. Furthermore, interoperatability can happen within a single library, through inline code or simply C and \CFA translation units archived together. The fine-grained interoperatability between C and \CFA has two consequences:
+\begin{enumerate}
+	\item Precisely identifying C calls that could block is difficult.
+	\item Introducing code where interoperatability occurs could have a significant impact on general performance.
+\end{enumerate}
+
+Because of these consequences, this work does not attempt to ``sandbox'' calls to C. It is possible that conflicting calls to C could lead to deadlocks on \CFA's M:N threading model where they would not in the traditionnal 1:1 threading model. However, I judge that solving this problem in general, in a way that is composable and flexible, is too complex in itself and would add too much work to this thesis. Therefore it is outside the scope of this thesis.
