source: doc/theses/thierry_delisle_PhD/comp_II/comp_II_too_big.tex @ 2ae6a99

ADTast-experimentalpthread-emulation
Last change on this file since 2ae6a99 was df75fe97, checked in by Thierry Delisle <tdelisle@…>, 5 years ago

Committing first draft of my comp-II

  • Property mode set to 100644
File size: 20.1 KB
Line 
1\documentclass[11pt,fullpage]{article}
2\usepackage[T1]{fontenc}
3\usepackage[utf8]{inputenc}
4\usepackage{listings}           % for code listings
5\usepackage{xspace}
6\usepackage{xcolor}
7\usepackage{graphicx}
8\usepackage[hidelinks]{hyperref}
9\usepackage{glossaries}
10\usepackage{textcomp}
11\usepackage{geometry}
12
13% cfa macros used in the document
14\input{common}
15\input{glossary}
16
17\CFAStyle                               % use default CFA format-style
18
19\title{
20        \Huge \vspace*{1in} The \CFA Scheduler\\
21        \huge \vspace*{0.25in} PhD Comprehensive II Research Proposal
22        \vspace*{1in}
23}
24
25\author{
26        \huge Thierry Delisle \\
27        \Large \vspace*{0.1in} \texttt{tdelisle@uwaterloo.ca} \\
28        \Large Cheriton School of Computer Science \\
29        \Large University of Waterloo
30}
31
32\date{
33        \today
34}
35
36\begin{document}
37\maketitle
38\cleardoublepage
39
40\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
41\newcommand{\TODO}{\newline{\large\bf\color{red} TODO :}\xspace}
42
43% ===============================================================================
44% ===============================================================================
45
46\tableofcontents
47
48% ===============================================================================
49% ===============================================================================
50\section{Introduction}
51\subsection{\CFA and the \CFA concurrency package}
52\CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high productivity features while maintaning the predictible performance of C. As such concurrency in \CFA\cit aims to offer . Concurrent code is written in the syncrhonous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such the \CFA scheduler is a user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}
53
54\subsection{Scheduling for \CFA}
55While the \CFA concurrency package doesn't have any particular scheduling needs beyond those of any concurrency package which uses \glspl{uthrd}, it is important that the default \CFA Scheduler be viable in general. Indeed, since the \CFA Scheduler does not target any specific workloads, it is unrealistic to demand that it use the best scheduling strategy in all cases. However, it should offer a viable ``out of the box'' solution for most scheduling problems so that programmers can quickly write performant concurrent without needed to think about which scheduling strategy is more appropriate for their workload. Indeed, only programmers with exceptionnaly high performance requirements should need to write their own scheduler.
56
57As detailed in Section~\ref{metrics}, schedulers can be evaluated according to multiple metrics. It is therefore important to set objective goals for scheduling according to a high-level direction. As such the design goal for the scheduling strategy can be phrased as follows :
58\begin{quote}
59The \CFA scheduling strategy should be \emph{viable} for any workload.
60\end{quote}
61
62% ===============================================================================
63% ===============================================================================
64\section{Scheduling terms}
65Before going into details about scheduling, it is important to define the terms used in this document, especially since many of these terms are often overloaded and may have subtlely different meaning. All scheduling terms used are defined in the Glossary but the following terms are worth explaining now. \TODO fix the glossary ref
66
67\paragraph{\Gls{proc}:} Designates the abstract worker on which the work will be scheduled. The nature of the \gls{proc} can vary depending on the level at which the mapping is done. For example, OS scheduling maps \glspl{kthrd} onto \glspl{hthrd} and therefore in this context the \gls{proc} refers to the \gls{hthrd}. However, in the context of user space scheduling, the scheduler maps \glspl{uthrd}, \glspl{fiber} or \glspl{job} onto \glspl{kthrd}, at this level the \gls{kthrd} becomes the \gls{proc}.
68
69\paragraph{\Gls{at}:} This document uses the term \gls{at} to designate the units of work that are to be scheduled. Like \glspl{proc}, the nature of a \gls{at} varies depending on the level at which the scheduler operates. The \glspl{at} in OS schedulers are \glspl{kthrd}, while the \glspl{at} in user-space scheduling can be \glspl{uthrd}, \glspl{fiber} or \glspl{job}. The term \gls{at} was chosen specifically to avoid collisions with specific types of \gls{at}, \eg thread, \glspl{uthrd} and \glspl{fiber}.
70
71\paragraph{\Gls{Q}:} A \gls{Q} is a list of \glspl{at} that have been \glslink{at}{scheduled} but have yet to be \glslink{atrun}{run}. The number and nature of the \gls{Q} is scheduler specific and is generally a significant portion of the scheduler design.
72
73\paragraph{\Gls{atsched}:} Designates the act of signalling to the scheduler that some \gls{at} is ready to be executed and should eventually be assigned to a \gls{proc}.
74
75\paragraph{\Gls{atrun}:} Designates the act of act of actually running a \gls{at} that was scheduled. That is mapping the \gls{at} onto a specific \gls{proc} and having the \gls{proc} execute its code.
76
77\paragraph{\Gls{atblock}:} A blocked \gls{at} is an \gls{at} that exists outside of any \gls{Q}. Unless some external entity unblocks it (by \glslink{atsched}{scheduling} it), it will never run and will stay in this suspended state. Not all schedulers support blocking. \eg \gls{job} schedulers do not, a new \gls{job} must be created if the current one would need to wait for an external event.
78
79\paragraph{\Gls{atmig}:} A \gls{at} is generally considered to have migrated when it is \glslink{atrun}{run} from a different different \gls{proc} then the previous time it \glslink{atrun}{ran}. This concept is relevant because migration can often come at a performance cost which is mostly due to caching. However, some amount of migration is generally required to achieve load balancing. In the context of \glspl{job}, since \glspl{job} only run once, migration is define as being \glslink{atrun}{run} from a different different \gls{proc} the the one it was created on.
80
81\paragraph{\Gls{atpass}:} A \gls{at} as overtaken another \gls{at} when it was \glslink{atsched}{scheduled} later but \glslink{atrun}{run} before the other \gls{at}. Overtaking can have performance benefits but has obvious fairness concerns.
82
83% =================================================================================================
84% =================================================================================================
85\section{Previous Work}
86
87\subsection{Publications}
88\subsubsection{Work Stealing}
89A populer scheduling algorithm is \emph{Work Stealing}, which was used by \cite{burton1981executing, halstead1984implementation} to divide work among threads to be done in parallel. This specific flavor of work stealing, which maps units of work to be executed once\footnote{This single execution is what fundamentally distinguishes \emph{Jobs} from \emph{Threads}} onto kernel-threads.
90
91There are countless examples of work-stealing variations \cit, all of which shared the same basic idea : each worker has its own set of work units, referred to as work-queue in this document, and once it runs out, it ``steals'' from  other worker. Many variations of the algorithm exist but they generally have the following characteristics :
92
93\begin{itemize}
94        \item There exists a one-to-one mapping between \glspl{proc} and \glspl{Q}, \eg every thread has exactly one queue of jobs to execute.
95        \item Once added to a \gls{Q}, \glspl{at} can \glslink{atmig}{migrate} to another \gls{proc} only as a result of the \gls{proc}'s \gls{Q} being empty.
96\end{itemize}
97
98Where \glspl{at} are initially enqueued and what to do with unblocking \glspl{at} (which must be requeued) is generally where the variations occur in the algorithms.
99
100Distributing the \glspl{at} improves fairness\cit, while enqueuing new and recurring \glspl{at} on the same work-queue as they creator and previous \glspl{atrun} respectively improves locality\cit.
101
102
103It is worth pointing out that while both strategy can be very effective for certain workloads\cit
104
105\subsubsection{Other-Schedulers}
106
107\subsubsection{Theoretical Bounds}
108
109
110\subsection{In practice}
111In practive meany implementations seem to have converge towards schedulers which use a one to one mapping between \glspl{proc} and \glspl{Q}, often with one or several lower-priority shared \glspl{Q}. This is generally perceived as an improvement over schedulers with a single \glspl{Q} which can lead to contention problems. As mentionned further into this document, this is due to the implicit assumption that the \gls{Q} data structure cannot scale to many cores.
112
113Schedulers with multiple \glspl{Q} generally achieve better throughput can also introduce fairness related problems. The underlying problem is that these \glspl{Q} lead to priorities based on placement rather than per \gls{at}.
114
115\paragraph{Indefinite Starvation}
116A scheduler can have starve a \gls{at} indefinetely. In practice, schedulers without support for priorities can starve \glspl{at} if it does not have preemption and can fall in a stable state where a \gls{at} never \glslink{atblock}{blocks} and other \glspl{at} can get stuck behind it. If these \glspl{at} are never stolen (taken from a \gls{proc} which isn't the one mapped to that \gls{Q}) then they will be indefinetely starved.
117
118\paragraph{Poor load-balancing}
119If the scheduler reaches a stable state where per-\gls{proc} \glspl{Q} have significantly different average lengths (and are never-empty), this can lead to significant unfairness. Indeed, if load-balancing only occurs when a \gls{proc} runs out of work, any stable state where no \gls{proc} runs out of work means that there is no longer any load-balancing while in that state.
120
121\paragraph{Aggressive ``Nice'' \glspl{at}}
122Certain workloads can have ``Nice'' \glspl{at}, that is \glspl{at} which run in the background. These \glspl{at} generally have low amount of work and can run when the system isn't too busy. In systems without priorities, There are several techniques to implement background \glspl{at}, but not every technique may work with every scheduler.
123
124One approach is for the background task to yield every time it makes some small progress. If the yielding \gls{at} is put on a higher-priority \gls{Q} than workstealing, this can cause unfairness and can even cause the background task to fully utilize \gls{hthrd}
125
126A similar approach is to sleep for a short duration instead. However, depending on the details of the sleep, this can also cause problems. It is more likely that sleeping will cause a \gls{proc} to steal because the \gls{at} probably transitions out of the ready state. However, not all system support fine grain sleeping. If the sleeping is to coarse, relying on sleep can cause latency issues.
127
128Finally, this can be solve with explicit synchronization, however not all background tasks can be implemented this way since they don't necessarily have to wait for any ressource.
129
130\subsubsection*{Linux's CFS}
131\subsubsection*{FreeBSD scheduler}
132\subsubsection*{Windows OS Scheduler}
133Windows seems to use priority based scheduling, where the priorities can vary dynamically within some limit. Load-balancing across processors is handle using a work-sharing approch and has different rules based on whether or not there are idle processors.
134
135\subsubsection*{Windows User-Mode Scheduler}
136Windows seems to have M:N scheduling support but does not provide a default user-mode scheduler.
137
138\subsubsection*{Go}
139Go's scheduler uses a work-stealing algorithm without preemption that has a global runqueue(\emph{GRQ}) and each processor(\emph{P}) has a fixed-size runqueue(\emph{LRQ}).
140
141The algorithm is as follows :
142\begin{enumerate}
143        \item Once out of 61 times, directly pick 1 element from the \emph{GRQ}.
144        \item If there is a local next pick it.
145        \item Else pick an item from the \emph{LRQ}.
146        \item If it was empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}.
147        \item If it was empty steal \emph{half} the \emph{LRQ} of an other \emph{P}.
148\end{enumerate}
149
150Ignoring concerns of power consumption, this structure can lead to odd behaviour. Obviously, the fact that it is non-preemptable means that given a set of goroutines where $N$ of these never block (\eg CPU-bound producers) running on $P$ Processors (\emph{P}'s using the Go naming), if $N >= P$ than not all goroutines are guaranteed to ever make progress. However, this can also be the case even if $N < P$. Indeed, if $P - N$ processors can find a sustained amount of work without needing to steal, then indefinite starvation can still occur.
151
152Excluding cases due to the lack of preemption still leads to odd behavior. The fact that the LRQs have a fixed size can also effect the fairness of scheduling in drastic ways. The separation between \emph{LRQ} and \emph{GRQ} can lead to significant unfairness both in homogenous workloads and more heterogenous workloads.
153
154\subsubsection*{Erlang}
155Erlang uses a work-stealing schedulers with the addition that ``underloaded schedulers will also steal jobs from heavily overloaded schedulers in their migration paths''. Depending on the specifics of this heuristic
156
157\subsubsection*{Haskell}
158
159\subsubsection*{D}
160D does not appear to have an M:N threading model.
161\subsubsection*{Intel\textregistered ~Threading Building Blocks}
162https://software.intel.com/en-us/node/506295
163Intel's concurrency and parallelism library uses a more complicated scheduler which has multiple \glspl{Q} with various priority. However, these \glspl{Q} have a strict priority ordering meaning it is subject to tasks indefinetely starving lower priority tasks if programmers are not careful.
164\TODO test it
165
166\subsubsection*{Quasar}
167
168\subsubsection*{Grand Central Dispatch}
169
170\subsubsection*{LibFiber}
171
172\subsubsection*{Fiber Tasking Lib}
173Advertized in GDC and also uses work-stealing.
174
175
176\subsection{Scheduling Needs}
177Things to look at
178
179libuv : polling loop used for async I/O
180
181julia : uses libuv and has multi-threading
182
183
184
185Direction to look at :
186Static web-servers : mostly-I/O bound
187        - single or few thread, event driven
188        - thread per connections
189
190        examples:
191                - memcached
192                - apache
193                - nodejs stuff
194
195
196HPC workloads : compute bound
197
198
199% ===============================================================================
200% ===============================================================================
201
202\section{Overview of Scheduling for \CFA}
203
204\subsection{Scheduling : Core goals}
205
206\subsection{Requirements of the Scheduler Context}
207
208\subsection{Integration with I/O}
209
210\subsection{Blocking in Style}
211
212
213% ===============================================================================
214% ===============================================================================
215\section{Metrics of Scheduling} \label{metrics}
216Before starting to look into the design of the best scheduler for \CFA, it is important to take some time to detail what metrics to pick for scheduling.
217
218Here are a few metrics that should be considered.
219
220\paragraph{Throughput} is a fairly straightforward metric, it can be measured either in terms of how much time was spent to \glslink{atcomplet}{run to completion} for a fix number of \gls{at} or how many \gls{at} where \glslink{atrun}{ran} for a fix number of work.
221
222These two definitions are virtually interchangeable. However, since scheduling can affect application performance getting valid empirical measures for throughput can be difficult.
223
224\paragraph{Latency} measures how long an individual \gls{at} waited between when it was \glslink{atsched}{scheduled} and it was \glslink{atrun}{run}
225
226\paragraph{Fairness}
227
228\paragraph{Ressource Utilization}
229
230\paragraph{Application Performance}
231
232\section{The Core : Multi-Lane Scheduling}
233\subsection{Objectives}
234While work-stealing works well for both trivial cases, \ie all new \glspl{at} distributed evenly or all on a single work-queue, it handles poorly inbetween cases. As mentionned above, the goal is therefore to create a scheduler that is \emph{viable} for any workloads. More concretely, this means a scheduler that has good scalability and offers guarantees eventual progress. For the purpose of this document, eventual progress is defined as follows:
235
236\begin{itemize}
237        \item Any \Gls{at} that is \glslink{atsched}{scheduled} should eventually \glslink{atrun}{run}, regardless\footnote{In the context of guaranteeing eventual progress, we consider only normal program execution. . Depending on the chosen semantics, normal system shutdown can also prevent \glspl{at} from eventually running without considering the guarantee violated. } of any other \glspl{at} being scheduled (prior or otherwise).
238\end{itemize}
239
240Eventual progress is not guaranteed by work-stealing or work-sharing schedulers in every context. Indeed, as mentionned in \cit, when the system is \glslink{load}{loaded} neither work-stealing nor work-sharing guarantee eventual progress. These aberrant cases can be fixed with \gls{preemption}, they still show a fundamental fairness problem in the algorithm. We can offer a stricter guarantee of eventual progress by limitting the amount of \gls{at} \emph{\glslink{atpass}{overtaking}}. Indeed, cases where eventual progress is lost are cases that show \emph{unbounded} \glslink{atpass}{overtaking} and a such, a scheduler that limits \glslink{atpass}{overtaking} in general guarantees eventual progress.
241
242We can then define the first concrete goal of the \CFA scheduler as :
243\begin{itemize}
244        \item The odds that a \gls{at} is overtaken by $N$ other \glspl{at} decreases rapidly when $N$ increases.
245\end{itemize}
246
247As a second, more fuzzy, objective, the Cforall scheduler should also perform no worst than most existing scheduler for any workload.
248
249\subsection{Ideal description}
250The ideal scheduler should similarly to a multi-lane highway. If all lanes are equally fast, lane changes are to be avoided because they induce traffic jam. However, if some lanes are faster than others than lane changes will help balance the traffic.
251
252Similarly, a task should migrate in two cases:
253\begin{itemize}
254        \item When a worker just emptied its work-queue. (like work-stealing)
255        \item \emph{When a work-unit was overtaken too many times by work units in a different lane}.
256\end{itemize}
257
258
259\subsection{Practical terms}
260In practice, the highway lane analogy has to be adjusted slightly to make it practical. First, the \glspl{at} should always respect the order within a given lane. This means only the task at the front can migrate. This both enforces a stronger FIFO order and means that the scheduler can ignore all \glspl{at} that aren't at the front, simplifying processing. Furthermore, migrating \glspl{at} is only usefull when at least one worker is available, \ie looking for a task to run, because any migration made at any other time will only take effect at that moment.
261
262\subsection{Existing Work}
263
264\subsection{Cforall Today}
265
266
267
268% ===============================================================================
269% ===============================================================================
270\section{The Context : Scheduling in \CFA}
271\subsection{Dynamic Clusters}
272
273\subsection{Idle Sleep}
274
275\subsection{Locality}
276
277% ===============================================================================
278% ===============================================================================
279\section{Asynchronous IO}
280\subsection{Cooperation with the OS}
281
282\subsection{Framework Integration}
283
284% ===============================================================================
285% ===============================================================================
286\section{Blocking in Style}
287\subsection{Monitors and Baton-Passing}
288
289\subsection{Futures and Promises}
290
291
292% ===============================================================================
293% ===============================================================================
294\section{Current Work}
295
296\section{Conclusion}
297
298
299\cleardoublepage
300
301% B I B L I O G R A P H Y
302% -----------------------------
303\addcontentsline{toc}{chapter}{Bibliography}
304\bibliographystyle{plain}
305\bibliography{pl,local}
306\cleardoublepage
307\phantomsection         % allows hyperref to link to the correct page
308
309% G L O S S A R Y
310% -----------------------------
311\addcontentsline{toc}{chapter}{Glossary}
312\printglossary
313\cleardoublepage
314\phantomsection         % allows hyperref to link to the correct page
315
316\end{document}
Note: See TracBrowser for help on using the repository browser.