1 | \chapter{Introduction}\label{intro} |
---|
2 | \todo{A proper intro} |
---|
3 | |
---|
4 | Any shared system needs scheduling. |
---|
5 | Computer systems share multiple resources across many threads of execution, even on single user computers like a laptop. |
---|
6 | Scheduling occurs at discreet points when there are transitions in a system. |
---|
7 | For example, a thread cycles through the following transitions during its execution. |
---|
8 | \begin{center} |
---|
9 | \input{executionStates.pstex_t} |
---|
10 | \end{center} |
---|
11 | These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s): |
---|
12 | \begin{itemize} |
---|
13 | \item |
---|
14 | entering the system (new $\rightarrow$ ready) |
---|
15 | \item |
---|
16 | timer alarm for preemption (running $\rightarrow$ ready) |
---|
17 | \item |
---|
18 | long term delay versus spinning (running $\rightarrow$ blocked) |
---|
19 | \item |
---|
20 | blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready) |
---|
21 | \item |
---|
22 | normal completion or error, \ie segment fault (running $\rightarrow$ halted) |
---|
23 | \end{itemize} |
---|
24 | Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system. |
---|
25 | |
---|
26 | In detail, in a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}. |
---|
27 | These systems are normally \newterm{open}, meaning new work arrives from an external source or spawned from an existing work unit. |
---|
28 | Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time. |
---|
29 | (Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.) |
---|
30 | |
---|
31 | On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization. |
---|
32 | A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal. |
---|
33 | Even with complete knowledge of arrive order and work, an optimal solution is NP (bin packing). |
---|
34 | However, schedulers do take advantage of regularities in work patterns to produce excellent solutions. |
---|
35 | Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns. |
---|
36 | |
---|
37 | When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. |
---|
38 | (In real-life, a work unit (person) can \newterm{balk}, and leave the system rather than wait.) |
---|
39 | Ready queues organize threads for scheduling, which indirectly organizes the work to be performed. |
---|
40 | The structure of ready queues forms a spectrum. |
---|
41 | At one end is the single-queue multi-server (SIMS) and at the other end is the multi-queue multi-server (MOMS). |
---|
42 | \begin{center} |
---|
43 | \begin{tabular}{l|l} |
---|
44 | \multicolumn{1}{c|}{\textbf{SIMS}} & \multicolumn{1}{c}{\textbf{MOMS}} \\ |
---|
45 | \hline |
---|
46 | \raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t} |
---|
47 | \end{tabular} |
---|
48 | \end{center} |
---|
49 | (While \newterm{pipeline} organizations also exist, \ie chaining schedulers together, they do not provide an advantage in this context.) |
---|
50 | |
---|
51 | The three major optimization criteria for a scheduler are: |
---|
52 | \begin{enumerate}[leftmargin=*] |
---|
53 | \item |
---|
54 | \newterm{load balancing}: available work is distributed so no processor is idle when work is available. |
---|
55 | |
---|
56 | \noindent |
---|
57 | Eventual progress for each work unit is often an important consideration, \ie no starvation. |
---|
58 | \item |
---|
59 | \newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors. |
---|
60 | |
---|
61 | \noindent |
---|
62 | Essentially, all multi-processor computers have non-uniform memory access (NUMB), with one or more quantized steps to access data at different levels (steps) in the memory hierarchy. |
---|
63 | When a system has a large number of independently executing threads, affinity is impossible because of \newterm{thread churn}. |
---|
64 | That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors. |
---|
65 | |
---|
66 | \item |
---|
67 | \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in the form of locking\footnote{ |
---|
68 | Lock-free data-structures is still locking because all forms of locking invoke delays.} |
---|
69 | |
---|
70 | \noindent |
---|
71 | Locking cost and latency increases significantly with the number of processors accessing a shared object. |
---|
72 | \end{enumerate} |
---|
73 | |
---|
74 | SIMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. |
---|
75 | MOMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. |
---|
76 | Between these two schedulers are a host of options, \ie adding an optional global, shared ready-queue to MOMS. |
---|
77 | Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively. |
---|
78 | These approaches attempt to perform better load-balancing at the cost of affinity and contention. |
---|
79 | Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues |
---|
80 | |
---|
81 | Note, a computer system is often lightly loaded (or has no load); |
---|
82 | any scheduler can handle this situation, \ie all schedulers are equal in this context. |
---|
83 | As the workload increases, there is a point where some schedulers begin to perform better than others based on the above criteria. |
---|
84 | A poorer scheduler might saturate for some reason and not be able to assign available work to idle processors, \ie processors are spinning when work is available. |
---|
85 | |
---|
86 | |
---|
87 | \section{\CFA programming language} |
---|
88 | |
---|
89 | \todo{Brief discussion on \CFA features used in the thesis.} |
---|
90 | |
---|
91 | The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs. |
---|
92 | My previous master's thesis on concurrent in \CFA focused on features and interfaces. |
---|
93 | This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. Specifically, this work concentrates on scheduling and \glsxtrshort{io}. Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no non-blocking I/O capabilities at the user-thread level. |
---|
94 | |
---|
95 | As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work. |
---|
96 | |
---|
97 | |
---|
98 | \section{Contributions} |
---|
99 | \label{s:Contributions} |
---|
100 | |
---|
101 | This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system: |
---|
102 | \begin{enumerate}[leftmargin=*] |
---|
103 | \item |
---|
104 | Guarantee no thread or set of threads can conspire to prevent other threads from executing. |
---|
105 | \begin{itemize} |
---|
106 | \item |
---|
107 | There must exist some form of preemption to prevent CPU-bound threads from gaining exclusive access to one or more processors. |
---|
108 | \item |
---|
109 | There must be a scheduler guarantee that threads are executed after CPU-bound threads are preempted. |
---|
110 | Otherwise, CPU-bound threads can immediately rescheduled, negating the preemption. |
---|
111 | \end{itemize} |
---|
112 | Hence, the runtime/scheduler provides a notion of fairness that is impossible for a programmer to violate. |
---|
113 | \item |
---|
114 | Provide comprehensive scheduling across all forms of preemption and blocking, where threads move from the running to ready or blocked states. |
---|
115 | |
---|
116 | Once a thread stops running, the processor executing it must be rescheduled, if possible. |
---|
117 | Knowing what threads are in the ready state for scheduling is difficult because blocking operations complete asynchronous and with poor notification. |
---|
118 | \item |
---|
119 | Efficiently deal with unbalanced workloads among processors. |
---|
120 | |
---|
121 | Simpler to |
---|
122 | \item |
---|
123 | Efficiently deal with idle processors when there is less work than the available computing capacity. |
---|
124 | \end{enumerate} |
---|