# source:doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

Last change on this file was 6e8d446, checked in by Peter A. Buhr <pabuhr@…>, 15 hours ago

add some outline material for the introduction

• Property mode set to 100644
File size: 7.9 KB
RevLine
[6e8d446]1\chapter{Introduction}\label{intro}
[b9537e6]2\todo{A proper intro}
3
[6e8d446]4Any shared system needs scheduling.
5Computer systems share multiple resources across many threads of execution, even on single user computers like a laptop.
6Scheduling occurs at discreet points when there are transitions in a system.
7For example, a thread cycles through the following transitions during its execution.
8\begin{center}
9\input{executionStates.pstex_t}
10\end{center}
11These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):
12\begin{itemize}
13\item
14entering the system (new $\rightarrow$ ready)
15\item
16timer alarm for preemption (running $\rightarrow$ ready)
17\item
18long term delay versus spinning (running $\rightarrow$ blocked)
19\item
20blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready)
21\item
22normal completion or error, \ie segment fault (running $\rightarrow$ halted)
23\end{itemize}
24Key to scheduling is that a thread cannot bypass the ready'' state during a transition so the scheduler maintains complete control of the system.
25
26In detail, in a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}.
27These systems are normally \newterm{open}, meaning new work arrives from an external source or spawned from an existing work unit.
28Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.
29(Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.)
30
31On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.
32A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.
33Even with complete knowledge of arrive order and work, an optimal solution is NP (bin packing).
34However, schedulers do take advantage of regularities in work patterns to produce excellent solutions.
35Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns.
36
37When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
38(In real-life, a work unit (person) can \newterm{balk}, and leave the system rather than wait.)
39Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
40The structure of ready queues forms a spectrum.
41At one end is the single-queue multi-server (SIMS) and at the other end is the multi-queue multi-server (MOMS).
42\begin{center}
43\begin{tabular}{l|l}
44\multicolumn{1}{c|}{\textbf{SIMS}} & \multicolumn{1}{c}{\textbf{MOMS}} \\
45\hline
46\raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t}
47\end{tabular}
48\end{center}
49(While \newterm{pipeline} organizations also exist, \ie chaining schedulers together, they do not provide an advantage in this context.)
50
51The three major optimization criteria for a scheduler are:
52\begin{enumerate}[leftmargin=*]
53\item
54\newterm{load balancing}: available work is distributed so no processor is idle when work is available.
55
56\noindent
57Eventual progress for each work unit is often an important consideration, \ie no starvation.
58\item
59\newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors.
60
61\noindent
62Essentially, all multi-processor computers have non-uniform memory access (NUMB), with one or more quantized steps to access data at different levels (steps) in the memory hierarchy.
63When a system has a large number of independently executing threads, affinity is impossible because of \newterm{thread churn}.
64That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors.
65
66\item
67\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in the form of locking\footnote{
68Lock-free data-structures is still locking because all forms of locking invoke delays.}
69
70\noindent
71Locking cost and latency increases significantly with the number of processors accessing a shared object.
72\end{enumerate}
73
74SIMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
75MOMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
76Between these two schedulers are a host of options, \ie adding an optional global, shared ready-queue to MOMS.
77Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.
78These approaches attempt to perform better load-balancing at the cost of affinity and contention.
79Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues
80
81Note, a computer system is often lightly loaded (or has no load);
82any scheduler can handle this situation, \ie all schedulers are equal in this context.
83As the workload increases, there is a point where some schedulers begin to perform better than others based on the above criteria.
84A poorer scheduler might saturate for some reason and not be able to assign available work to idle processors, \ie processors are spinning when work is available.
85
86
87\section{\CFA programming language}
88
89\todo{Brief discussion on \CFA features used in the thesis.}
[b9537e6]90
[512d3dc]91The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
92My previous master's thesis on concurrent in \CFA focused on features and interfaces.
93This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. Specifically, this work concentrates on scheduling and \glsxtrshort{io}. Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and  no non-blocking I/O capabilities at the user-thread level.
[b9537e6]94
[512d3dc]95As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
[6e8d446]96
97
98\section{Contributions}
99\label{s:Contributions}
100
101This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:
102\begin{enumerate}[leftmargin=*]
103\item
105\begin{itemize}
106\item
107There must exist some form of preemption to prevent CPU-bound threads from gaining exclusive access to one or more processors.
108\item
109There must be a scheduler guarantee that threads are executed after CPU-bound threads are preempted.
110Otherwise, CPU-bound threads can immediately rescheduled, negating the preemption.
111\end{itemize}
112Hence, the runtime/scheduler provides a notion of fairness that is impossible for a programmer to violate.
113\item
114Provide comprehensive scheduling across all forms of preemption and blocking, where threads move from the running to ready or blocked states.
115
116Once a thread stops running, the processor executing it must be rescheduled, if possible.
117Knowing what threads are in the ready state for scheduling is difficult because blocking operations complete asynchronous and with poor notification.
118\item
119Efficiently deal with unbalanced workloads among processors.
120
121Simpler to
122\item
123Efficiently deal with idle processors when there is less work than the available computing capacity.
124\end{enumerate}
Note: See TracBrowser for help on using the repository browser.