source: doc/theses/thierry_delisle_PhD/thesis/text/existing.tex @ 0c51c8b4

Last change on this file since 0c51c8b4 was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 2 years ago

Last corrections to my thesis... hopefully

  • Property mode set to 100644
File size: 21.7 KB
Line 
1\chapter{Previous Work}\label{existing}
2As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa.
3Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc.
4Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling.
5
6In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
7Workloads that are well known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
8A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
9This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
10This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
11
12Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods.
13When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
14
15\section{Naming Convention}
16Scheduling has been studied by various communities, concentrating on different incarnations of the same problems.
17As a result, there are no standard naming conventions for scheduling that are respected across these communities.
18This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
19
20\section{Static Scheduling}
21\newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling.
22The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution.
23This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information.
24In general, static schedulers are less relevant to this project because they require input from the programmers that the \CFA programming language does not have as part of its concurrency semantics.
25Specifying this information explicitly adds a significant burden to the programmer and reduces flexibility.
26For this reason, the \CFA scheduler does not require this information.
27
28\section{Dynamic Scheduling}
29\newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all.
30This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies.
31Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
32As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
33
34\subsection{Explicitly Informed Dynamic Schedulers}
35While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc.
36When available, a scheduler can then use this information to direct scheduling decisions.
37For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
38However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
39at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \at takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
40Providing this kind of information is a significant programmer burden, especially if the information does not scale with the number of \ats and their complexity.
41For example, providing an exhaustive list of files read by 5 \ats is an easier requirement than providing an exhaustive list of memory addresses accessed by 10,000 independent \ats.
42
43Since the goal of this thesis is to provide a scheduler as a replacement for \CFA's existing \emph{uninformed} scheduler, explicitly informed schedulers are less relevant to this project. Nevertheless, some strategies are worth mentioning.
44
45\subsubsection{Priority Scheduling}
46A common approach to direct the scheduling algorithm is to add information about \at priority.
47Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones.
48The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority.
49Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \ats.
50It can therefore be desirable for schedulers to support \ats with identical priorities and/or automatically set and adjust priorities for \ats.
51Most common operating systems use some variant on priorities with overlaps and dynamic priority adjustments.
52For example, Microsoft Windows uses a pair of priorities~\cite{win:priority}, one specified by users out of ten possible options and one adjusted by the system.
53
54\subsection{Uninformed and Self-Informed Dynamic Schedulers}
55Several scheduling algorithms do not require programmers to provide additional information on each \at, and instead, make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
56
57
58\subsubsection{Feedback Scheduling}
59As mentioned, schedulers may also gather information about each \at to direct their decisions.
60This design effectively moves the scheduler into the realm of \newterm{Control Theory}~\cite{wiki:controltheory}.
61This information gathering does not generally involve programmers, and as such, does not increase programmer burden the same way explicitly provided information may.
62However, some feedback schedulers do allow programmers to offer additional information on certain \ats, to direct scheduling decisions.
63The important distinction is whether the scheduler can function without this additional information.
64
65
66\section{Work Stealing}\label{existing:workstealing}
67One of the most popular scheduling algorithms in practice (see~\ref{existing:prod}) is work stealing.
68This idea, introduced by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first but allows the possibility for other workers to steal local \ats if they run out of \ats.
69\cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each worker has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm has trees of \ats and steals only among neighbours.}.
70Blumofe and Leiserson also prove worst-case space and time requirements for well-structured computations.
71
72Many variations of this algorithm have been proposed over the years~\cite{DBLP:journals/ijpp/YangH18}, both optimizations of existing implementations and approaches that account for new metrics.
73
74\paragraph{Granularity} A significant portion of early work-stealing research concentrated on \newterm{Implicit Parallelism}~\cite{wiki:implicitpar}.
75Since the system is responsible for splitting the work, granularity is a challenge that cannot be left to programmers, as opposed to \newterm{Explicit Parallelism}\cite{wiki:explicitpar} where the burden can be left to programmers.
76In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead.
77The best performance generally means finding a middle ground between the two.
78Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse-grained.
79
80\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
81In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
82However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others.
83A common example where this is statically true is architectures with \glsxtrshort{numa}.
84In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}.
85Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}.
86
87\paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures.
88This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
89As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis.
90However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant.
91
92\subsection{Theoretical Results}
93There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
94Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
95Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
96\cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
97
98However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
99It is unclear how well these distributions represent workloads in real-world scenarios.
100
101\section{Preemption}
102One last aspect of scheduling is preemption since many schedulers rely on it for some of their guarantees.
103Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
104There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration.
105This helps schedulers guarantee that no \ats unfairly monopolize a worker.
106Preemption can effectively be added to any scheduler.
107Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it.
108
109\section{Production Schedulers}\label{existing:prod}
110This section presents a quick overview of several current schedulers.
111While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers.
112As such, I believe these schedulers are as relevant as those presented in published work.
113Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project.
114However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines.
115These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project.
116As such real-time schedulers are not considered for this work.
117
118\subsection{Operating System Schedulers}
119Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
120Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
121The information is less complete for closed source operating systems.
122
123\paragraph{Linux's CFS}
124The default scheduler used by Linux, the Completely Fair Scheduler~\cite{MAN:linux/cfs,MAN:linux/cfs2}, is a feedback scheduler based on CPU time.
125For each processor, it constructs a Red-Black tree of \ats waiting to run, ordering them by the amount of CPU time used.
126The \at that has used the least CPU time is scheduled.
127It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time used.
128The ordering of \ats is also affected by a group-based notion of fairness, where \ats belonging to groups having used less CPU time are preferred to \ats belonging to groups having used more CPU time.
129Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the \gls{load}, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
130
131\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each \proc, but the paper argues this aspect can be improved significantly.
132The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one-thousandth of the CPU time.}, increasing the complexity.
133
134Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO scheduler that adds fixed time slices. \cite{MAN:linux/sched}
135
136\paragraph{FreeBSD}
137The ULE scheduler used in FreeBSD\cite{DBLP:conf/bsdcon/Roberson03} is a feedback scheduler similar to Linux's CFS.
138It uses different data structures and heuristics but also schedules according to some combination of CPU time used and niceness values.
139It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach.
140
141\paragraph{Windows (OS)}
142Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
143It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
144It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \at has used.
145The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
146
147The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs''.
148Multicore scheduling is based on a combination of priorities and \proc preference.
149Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
150\Glspl{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on.
151This approach is a variation of work stealing, where the stealing \proc restores the \at to its original \proc after running it, but mixed with priorities.
152
153\paragraph{Apple OS X}
154Apple programming documentation describes its kernel scheduler as follows:
155\begin{displayquote}
156        OS X is based on Mach and BSD. [...]. It contains an advanced scheduler based on the CMU Mach 3 scheduler. [...] Mach scheduling is based on a system of run queues at various priorities.
157
158        \begin{flushright}
159                -- Kernel Programming Guide \cite{MAN:apple/scheduler}
160        \end{flushright}
161\end{displayquote}
162
163There is very little documentation on the internals of this scheduler.
164However, the documentation does describe a feature set that is very similar to the Windows and Linux OS schedulers.
165Presumably, this means that the internals are also fairly similar overall.
166
167\subsection{User-Level Schedulers}
168By comparison, user-level schedulers tend to be simpler, gather fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all \ats have the same user, and therefore cooperation is both feasible and probable.
169
170\paragraph{Go}\label{GoSafePoint}
171Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
172Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries.
173
174The algorithm is as follows :
175\begin{enumerate}
176        \item Once out of 61 times, pick 1 element from the \emph{GRQ}.
177        \item Otherwise, if there is an item in the ``chair'' pick it.
178        \item Else pick an item from the \emph{LRQ}.
179        \begin{itemize}
180        \item If it is empty steal (len(\emph{GRQ}) / \#of\emph{P}) + 1 items (max 256) from the \emph{GRQ}
181        \item and steal \emph{half} the \emph{LRQ} of another \emph{P} chosen randomly.
182        \end{itemize}
183\end{enumerate}
184
185Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation.
186
187\paragraph{Erlang}
188Erlang is a functional language that supports concurrency in the form of processes: threads that share no data.
189It uses a kind of round-robin scheduler, with a mix of work sharing and stealing to achieve load balancing~\cite{:erlang}, where under-loaded workers steal from other workers, but overloaded workers also push work to other workers.
190This \glslink{atmig}{migration} logic is directed by monitoring logic that evaluates the load a few times per second.
191
192\paragraph{Intel\textregistered ~Threading Building Blocks}
193\newterm{Thread Building Blocks} (TBB) is Intel's task parallelism \cite{wiki:taskparallel} framework.
194It runs \newterm{jobs}, which are uninterruptible \ats that must always run to completion, on a pool of worker threads.
195TBB's scheduler is a variation of randomized work-stealing that also supports higher-priority graph-like dependencies~\cite{MAN:tbb/scheduler}.
196It schedules \ats as follows (where \textit{t} is the last \at completed):
197\begin{displayquote}
198        \begin{enumerate}
199                \item The task returned by \textit{t}@.execute()@
200                \item The successor of t if \textit{t} was its last completed predecessor.
201                \item A task popped from the end of the thread's own queue.
202                \item A task with an affinity for the thread.
203                \item A task popped from approximately the beginning of the shared queue.
204                \item A task popped from the beginning of another randomly chosen thread's queue.
205        \end{enumerate}
206
207        \begin{flushright}
208                -- Intel\textregistered ~TBB documentation \cite{MAN:tbb/scheduler}
209        \end{flushright}
210\end{displayquote}
211
212\paragraph{Quasar/Project Loom}
213Java has two projects, Quasar~\cite{MAN:quasar} and Project Loom~\cite{MAN:project-loom}\footnote{It is unclear if these are distinct projects.}, that are attempting to introduce lightweight thread\-ing in the form of Fibers.
214Both projects seem to be based on the @ForkJoinPool@ in Java, which appears to be a simple incarnation of randomized work-stealing~\cite{MAN:java/fork-join}.
215
216\paragraph{Grand Central Dispatch}
217An Apple~\cite{apple:gcd} API that offers task parallelism~\cite{wiki:taskparallel}.
218Its distinctive aspect is multiple ``Dispatch Queues'', some of which are created by programmers.
219Each queue has its own local ordering guarantees, \eg \ats on queue $A$ are executed in \emph{FIFO} order.
220
221While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests a fairly classic approach.
222Each \proc has a queue of \ats to run, called \newterm{blocks}, which are drained in \glsxtrshort{fifo}.
223GCD also has secondary queues, called \newterm{Dispatch Queues}, with clear ordering, where executing a block ends up scheduling more blocks.
224In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.
225
226The similarity of API and semantics between GCD and Intel\textregistered ~TBB suggest the underlying scheduling algorithms are similar.
227
228\paragraph{LibFibre}
229LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo.
230It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing.
231Unlike Go, it does not have the high-priority next ``chair'' and its work-stealing is not randomized.
232
233Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation.
Note: See TracBrowser for help on using the repository browser.