\documentclass[11pt,fullpage]{article}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{listings}		% for code listings
\usepackage{xspace}
\usepackage{xcolor}
\usepackage{graphicx}
\usepackage[hidelinks]{hyperref}
\usepackage{glossaries}
\usepackage{textcomp}
\usepackage{geometry}

% cfa macros used in the document
\input{common}
\input{glossary}

\CFAStyle				% use default CFA format-style

\title{
	\Huge \vspace*{1in} The \CFA Scheduler\\
	\huge \vspace*{0.25in} PhD Comprehensive II Research Proposal
	\vspace*{1in}
}

\author{
	\huge Thierry Delisle \\
	\Large \vspace*{0.1in} \texttt{tdelisle@uwaterloo.ca} \\
	\Large Cheriton School of Computer Science \\
	\Large University of Waterloo
}

\date{
	\today
}

\begin{document}
\maketitle
\cleardoublepage

\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
\newcommand{\TODO}{\newline{\large\bf\color{red} TODO :}\xspace}

% ===============================================================================
% ===============================================================================

\tableofcontents

% ===============================================================================
% ===============================================================================
\section{Introduction}
\subsection{\CFA and the \CFA concurrency package}
\CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high productivity features while maintaning the predictible performance of C. As such concurrency in \CFA\cit aims to offer . Concurrent code is written in the syncrhonous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such the \CFA scheduler is a user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}

\subsection{Scheduling for \CFA}
While the \CFA concurrency package doesn't have any particular scheduling needs beyond those of any concurrency package which uses \glspl{uthrd}, it is important that the default \CFA Scheduler be viable in general. Indeed, since the \CFA Scheduler does not target any specific workloads, it is unrealistic to demand that it use the best scheduling strategy in all cases. However, it should offer a viable ``out of the box'' solution for most scheduling problems so that programmers can quickly write performant concurrent without needed to think about which scheduling strategy is more appropriate for their workload. Indeed, only programmers with exceptionnaly high performance requirements should need to write their own scheduler.

As detailed in Section~\ref{metrics}, schedulers can be evaluated according to multiple metrics. It is therefore important to set objective goals for scheduling according to a high-level direction. As such the design goal for the scheduling strategy can be phrased as follows :
\begin{quote}
The \CFA scheduling strategy should be \emph{viable} for any workload.
\end{quote}

% ===============================================================================
% ===============================================================================
\section{Scheduling terms}
Before going into details about scheduling, it is important to define the terms used in this document, especially since many of these terms are often overloaded and may have subtlely different meaning. All scheduling terms used are defined in the Glossary but the following terms are worth explaining now. \TODO fix the glossary ref

\paragraph{\Gls{proc}:} Designates the abstract worker on which the work will be scheduled. The nature of the \gls{proc} can vary depending on the level at which the mapping is done. For example, OS scheduling maps \glspl{kthrd} onto \glspl{hthrd} and therefore in this context the \gls{proc} refers to the \gls{hthrd}. However, in the context of user space scheduling, the scheduler maps \glspl{uthrd}, \glspl{fiber} or \glspl{job} onto \glspl{kthrd}, at this level the \gls{kthrd} becomes the \gls{proc}.

\paragraph{\Gls{at}:} This document uses the term \gls{at} to designate the units of work that are to be scheduled. Like \glspl{proc}, the nature of a \gls{at} varies depending on the level at which the scheduler operates. The \glspl{at} in OS schedulers are \glspl{kthrd}, while the \glspl{at} in user-space scheduling can be \glspl{uthrd}, \glspl{fiber} or \glspl{job}. The term \gls{at} was chosen specifically to avoid collisions with specific types of \gls{at}, \eg thread, \glspl{uthrd} and \glspl{fiber}.

\paragraph{\Gls{Q}:} A \gls{Q} is a list of \glspl{at} that have been \glslink{at}{scheduled} but have yet to be \glslink{atrun}{run}. The number and nature of the \gls{Q} is scheduler specific and is generally a significant portion of the scheduler design.

\paragraph{\Gls{atsched}:} Designates the act of signalling to the scheduler that some \gls{at} is ready to be executed and should eventually be assigned to a \gls{proc}.

\paragraph{\Gls{atrun}:} Designates the act of act of actually running a \gls{at} that was scheduled. That is mapping the \gls{at} onto a specific \gls{proc} and having the \gls{proc} execute its code.

\paragraph{\Gls{atblock}:} A blocked \gls{at} is an \gls{at} that exists outside of any \gls{Q}. Unless some external entity unblocks it (by \glslink{atsched}{scheduling} it), it will never run and will stay in this suspended state. Not all schedulers support blocking. \eg \gls{job} schedulers do not, a new \gls{job} must be created if the current one would need to wait for an external event.

\paragraph{\Gls{atmig}:} A \gls{at} is generally considered to have migrated when it is \glslink{atrun}{run} from a different different \gls{proc} then the previous time it \glslink{atrun}{ran}. This concept is relevant because migration can often come at a performance cost which is mostly due to caching. However, some amount of migration is generally required to achieve load balancing. In the context of \glspl{job}, since \glspl{job} only run once, migration is define as being \glslink{atrun}{run} from a different different \gls{proc} the the one it was created on.

\paragraph{\Gls{atpass}:} A \gls{at} as overtaken another \gls{at} when it was \glslink{atsched}{scheduled} later but \glslink{atrun}{run} before the other \gls{at}. Overtaking can have performance benefits but has obvious fairness concerns.

% =================================================================================================
% =================================================================================================
\section{Previous Work}

\subsection{Publications}
\subsubsection{Work Stealing}
A populer scheduling algorithm is \emph{Work Stealing}, which was used by \cite{burton1981executing, halstead1984implementation} to divide work among threads to be done in parallel. This specific flavor of work stealing, which maps units of work to be executed once\footnote{This single execution is what fundamentally distinguishes \emph{Jobs} from \emph{Threads}} onto kernel-threads.

There are countless examples of work-stealing variations \cit, all of which shared the same basic idea : each worker has its own set of work units, referred to as work-queue in this document, and once it runs out, it ``steals'' from  other worker. Many variations of the algorithm exist but they generally have the following characteristics :

\begin{itemize}
	\item There exists a one-to-one mapping between \glspl{proc} and \glspl{Q}, \eg every thread has exactly one queue of jobs to execute.
	\item Once added to a \gls{Q}, \glspl{at} can \glslink{atmig}{migrate} to another \gls{proc} only as a result of the \gls{proc}'s \gls{Q} being empty.
\end{itemize}

Where \glspl{at} are initially enqueued and what to do with unblocking \glspl{at} (which must be requeued) is generally where the variations occur in the algorithms.

Distributing the \glspl{at} improves fairness\cit, while enqueuing new and recurring \glspl{at} on the same work-queue as they creator and previous \glspl{atrun} respectively improves locality\cit.


It is worth pointing out that while both strategy can be very effective for certain workloads\cit

\subsubsection{Other-Schedulers}

\subsubsection{Theoretical Bounds}


\subsection{In practice}
In practive meany implementations seem to have converge towards schedulers which use a one to one mapping between \glspl{proc} and \glspl{Q}, often with one or several lower-priority shared \glspl{Q}. This is generally perceived as an improvement over schedulers with a single \glspl{Q} which can lead to contention problems. As mentionned further into this document, this is due to the implicit assumption that the \gls{Q} data structure cannot scale to many cores.

Schedulers with multiple \glspl{Q} generally achieve better throughput can also introduce fairness related problems. The underlying problem is that these \glspl{Q} lead to priorities based on placement rather than per \gls{at}.

\paragraph{Indefinite Starvation}
A scheduler can have starve a \gls{at} indefinetely. In practice, schedulers without support for priorities can starve \glspl{at} if it does not have preemption and can fall in a stable state where a \gls{at} never \glslink{atblock}{blocks} and other \glspl{at} can get stuck behind it. If these \glspl{at} are never stolen (taken from a \gls{proc} which isn't the one mapped to that \gls{Q}) then they will be indefinetely starved.

\paragraph{Poor load-balancing}
If the scheduler reaches a stable state where per-\gls{proc} \glspl{Q} have significantly different average lengths (and are never-empty), this can lead to significant unfairness. Indeed, if load-balancing only occurs when a \gls{proc} runs out of work, any stable state where no \gls{proc} runs out of work means that there is no longer any load-balancing while in that state.

\paragraph{Aggressive ``Nice'' \glspl{at}}
Certain workloads can have ``Nice'' \glspl{at}, that is \glspl{at} which run in the background. These \glspl{at} generally have low amount of work and can run when the system isn't too busy. In systems without priorities, There are several techniques to implement background \glspl{at}, but not every technique may work with every scheduler.

One approach is for the background task to yield every time it makes some small progress. If the yielding \gls{at} is put on a higher-priority \gls{Q} than workstealing, this can cause unfairness and can even cause the background task to fully utilize \gls{hthrd}

A similar approach is to sleep for a short duration instead. However, depending on the details of the sleep, this can also cause problems. It is more likely that sleeping will cause a \gls{proc} to steal because the \gls{at} probably transitions out of the ready state. However, not all system support fine grain sleeping. If the sleeping is to coarse, relying on sleep can cause latency issues.

Finally, this can be solve with explicit synchronization, however not all background tasks can be implemented this way since they don't necessarily have to wait for any ressource.

\subsubsection*{Linux's CFS}
\subsubsection*{FreeBSD scheduler}
\subsubsection*{Windows OS Scheduler}
Windows seems to use priority based scheduling, where the priorities can vary dynamically within some limit. Load-balancing across processors is handle using a work-sharing approch and has different rules based on whether or not there are idle processors.

\subsubsection*{Windows User-Mode Scheduler}
Windows seems to have M:N scheduling support but does not provide a default user-mode scheduler.

\subsubsection*{Go}
Go's scheduler uses a work-stealing algorithm without preemption that has a global runqueue(\emph{GRQ}) and each processor(\emph{P}) has a fixed-size runqueue(\emph{LRQ}).

The algorithm is as follows :
\begin{enumerate}
	\item Once out of 61 times, directly pick 1 element from the \emph{GRQ}.
	\item If there is a local next pick it.
	\item Else pick an item from the \emph{LRQ}.
	\item If it was empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}.
	\item If it was empty steal \emph{half} the \emph{LRQ} of an other \emph{P}.
\end{enumerate}

Ignoring concerns of power consumption, this structure can lead to odd behaviour. Obviously, the fact that it is non-preemptable means that given a set of goroutines where $N$ of these never block (\eg CPU-bound producers) running on $P$ Processors (\emph{P}'s using the Go naming), if $N >= P$ than not all goroutines are guaranteed to ever make progress. However, this can also be the case even if $N < P$. Indeed, if $P - N$ processors can find a sustained amount of work without needing to steal, then indefinite starvation can still occur.

Excluding cases due to the lack of preemption still leads to odd behavior. The fact that the LRQs have a fixed size can also effect the fairness of scheduling in drastic ways. The separation between \emph{LRQ} and \emph{GRQ} can lead to significant unfairness both in homogenous workloads and more heterogenous workloads.

\subsubsection*{Erlang}
Erlang uses a work-stealing schedulers with the addition that ``underloaded schedulers will also steal jobs from heavily overloaded schedulers in their migration paths''. Depending on the specifics of this heuristic

\subsubsection*{Haskell}

\subsubsection*{D}
D does not appear to have an M:N threading model.
\subsubsection*{Intel\textregistered ~Threading Building Blocks}
https://software.intel.com/en-us/node/506295
Intel's concurrency and parallelism library uses a more complicated scheduler which has multiple \glspl{Q} with various priority. However, these \glspl{Q} have a strict priority ordering meaning it is subject to tasks indefinetely starving lower priority tasks if programmers are not careful.
\TODO test it

\subsubsection*{Quasar}

\subsubsection*{Grand Central Dispatch}

\subsubsection*{LibFiber}

\subsubsection*{Fiber Tasking Lib}
Advertized in GDC and also uses work-stealing.


\subsection{Scheduling Needs}
Things to look at

libuv : polling loop used for async I/O

julia : uses libuv and has multi-threading


Direction to look at :
Static web-servers : mostly-I/O bound
	- single or few thread, event driven
	- thread per connections

	examples:
		- memcached
		- apache
		- nodejs stuff


HPC workloads : compute bound


% ===============================================================================
% ===============================================================================

\section{Overview of Scheduling for \CFA}

\subsection{Scheduling : Core goals}

\subsection{Requirements of the Scheduler Context}

\subsection{Integration with I/O}

\subsection{Blocking in Style}


% ===============================================================================
% ===============================================================================
\section{Metrics of Scheduling} \label{metrics}
Before starting to look into the design of the best scheduler for \CFA, it is important to take some time to detail what metrics to pick for scheduling.

Here are a few metrics that should be considered.

\paragraph{Throughput} is a fairly straightforward metric, it can be measured either in terms of how much time was spent to \glslink{atcomplet}{run to completion} for a fix number of \gls{at} or how many \gls{at} where \glslink{atrun}{ran} for a fix number of work.

These two definitions are virtually interchangeable. However, since scheduling can affect application performance getting valid empirical measures for throughput can be difficult.

\paragraph{Latency} measures how long an individual \gls{at} waited between when it was \glslink{atsched}{scheduled} and it was \glslink{atrun}{run}

\paragraph{Fairness}

\paragraph{Ressource Utilization}

\paragraph{Application Performance}

\section{The Core : Multi-Lane Scheduling}
\subsection{Objectives}
While work-stealing works well for both trivial cases, \ie all new \glspl{at} distributed evenly or all on a single work-queue, it handles poorly inbetween cases. As mentionned above, the goal is therefore to create a scheduler that is \emph{viable} for any workloads. More concretely, this means a scheduler that has good scalability and offers guarantees eventual progress. For the purpose of this document, eventual progress is defined as follows:

\begin{itemize}
	\item Any \Gls{at} that is \glslink{atsched}{scheduled} should eventually \glslink{atrun}{run}, regardless\footnote{In the context of guaranteeing eventual progress, we consider only normal program execution. . Depending on the chosen semantics, normal system shutdown can also prevent \glspl{at} from eventually running without considering the guarantee violated. } of any other \glspl{at} being scheduled (prior or otherwise).
\end{itemize}

Eventual progress is not guaranteed by work-stealing or work-sharing schedulers in every context. Indeed, as mentionned in \cit, when the system is \glslink{load}{loaded} neither work-stealing nor work-sharing guarantee eventual progress. These aberrant cases can be fixed with \gls{preemption}, they still show a fundamental fairness problem in the algorithm. We can offer a stricter guarantee of eventual progress by limitting the amount of \gls{at} \emph{\glslink{atpass}{overtaking}}. Indeed, cases where eventual progress is lost are cases that show \emph{unbounded} \glslink{atpass}{overtaking} and a such, a scheduler that limits \glslink{atpass}{overtaking} in general guarantees eventual progress.

We can then define the first concrete goal of the \CFA scheduler as :
\begin{itemize}
	\item The odds that a \gls{at} is overtaken by $N$ other \glspl{at} decreases rapidly when $N$ increases.
\end{itemize}

As a second, more fuzzy, objective, the Cforall scheduler should also perform no worst than most existing scheduler for any workload.

\subsection{Ideal description}
The ideal scheduler should similarly to a multi-lane highway. If all lanes are equally fast, lane changes are to be avoided because they induce traffic jam. However, if some lanes are faster than others than lane changes will help balance the traffic.

Similarly, a task should migrate in two cases:
\begin{itemize}
	\item When a worker just emptied its work-queue. (like work-stealing)
	\item \emph{When a work-unit was overtaken too many times by work units in a different lane}.
\end{itemize}


\subsection{Practical terms}
In practice, the highway lane analogy has to be adjusted slightly to make it practical. First, the \glspl{at} should always respect the order within a given lane. This means only the task at the front can migrate. This both enforces a stronger FIFO order and means that the scheduler can ignore all \glspl{at} that aren't at the front, simplifying processing. Furthermore, migrating \glspl{at} is only usefull when at least one worker is available, \ie looking for a task to run, because any migration made at any other time will only take effect at that moment.

\subsection{Existing Work}

\subsection{Cforall Today}


% ===============================================================================
% ===============================================================================
\section{The Context : Scheduling in \CFA}
\subsection{Dynamic Clusters}

\subsection{Idle Sleep}

\subsection{Locality}

% ===============================================================================
% ===============================================================================
\section{Asynchronous IO}
\subsection{Cooperation with the OS}

\subsection{Framework Integration}

% ===============================================================================
% ===============================================================================
\section{Blocking in Style}
\subsection{Monitors and Baton-Passing}

\subsection{Futures and Promises}


% ===============================================================================
% ===============================================================================
\section{Current Work}

\section{Conclusion}


\cleardoublepage

% B I B L I O G R A P H Y
% -----------------------------
\addcontentsline{toc}{chapter}{Bibliography}
\bibliographystyle{plain}
\bibliography{pl,local}
\cleardoublepage
\phantomsection		% allows hyperref to link to the correct page

% G L O S S A R Y
% -----------------------------
\addcontentsline{toc}{chapter}{Glossary}
\printglossary
\cleardoublepage
\phantomsection		% allows hyperref to link to the correct page

\end{document}