1 | \chapter{Introduction}\label{intro} |
---|
2 | \section{\CFA programming language} |
---|
3 | |
---|
4 | The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. |
---|
5 | Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs. |
---|
6 | My previous master's thesis on concurrent in \CFA focused on features and interfaces. |
---|
7 | This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. |
---|
8 | Specifically, this work concentrates on scheduling and \glsxtrshort{io}. |
---|
9 | Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}. |
---|
10 | |
---|
11 | As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. |
---|
12 | While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work. |
---|
13 | |
---|
14 | \section{Scheduling} |
---|
15 | Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones. |
---|
16 | On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}. |
---|
17 | These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit. |
---|
18 | On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization. |
---|
19 | A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal. |
---|
20 | With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}. |
---|
21 | However, optimal solutions are often not required. |
---|
22 | Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns. |
---|
23 | |
---|
24 | Scheduling occurs at discreet points when there are transitions in a system. |
---|
25 | For example, a thread cycles through the following transitions during its execution. |
---|
26 | \begin{center} |
---|
27 | \input{executionStates.pstex_t} |
---|
28 | \end{center} |
---|
29 | These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s): |
---|
30 | \begin{itemize} |
---|
31 | \item |
---|
32 | entering the system (new $\rightarrow$ ready) |
---|
33 | \item |
---|
34 | timer alarm for preemption (running $\rightarrow$ ready) |
---|
35 | \item |
---|
36 | long term delay versus spinning (running $\rightarrow$ blocked) |
---|
37 | \item |
---|
38 | blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready) |
---|
39 | \item |
---|
40 | normal completion or error, \ie segment fault (running $\rightarrow$ halted) |
---|
41 | \item |
---|
42 | scheduler assigns a thread to a resource (ready $\rightarrow$ running) |
---|
43 | \end{itemize} |
---|
44 | Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system. |
---|
45 | |
---|
46 | When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. |
---|
47 | Ready queues organize threads for scheduling, which indirectly organizes the work to be performed. |
---|
48 | The structure of ready queues can take many different forms. |
---|
49 | Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS). |
---|
50 | \begin{center} |
---|
51 | \begin{tabular}{l|l} |
---|
52 | \multicolumn{1}{c|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\ |
---|
53 | \hline |
---|
54 | \raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t} |
---|
55 | \end{tabular} |
---|
56 | \end{center} |
---|
57 | Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS. |
---|
58 | |
---|
59 | The three major optimization criteria for a scheduler are: |
---|
60 | \begin{enumerate}[leftmargin=*] |
---|
61 | \item |
---|
62 | \newterm{load balancing}: available work is distributed so no processor is idle when work is available. |
---|
63 | |
---|
64 | \noindent |
---|
65 | Eventual progress for each work unit is often an important consideration, \ie no starvation. |
---|
66 | \item |
---|
67 | \newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors. |
---|
68 | |
---|
69 | \noindent |
---|
70 | Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. |
---|
71 | When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. |
---|
72 | That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors. |
---|
73 | |
---|
74 | \item |
---|
75 | \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{ |
---|
76 | Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.} |
---|
77 | |
---|
78 | \noindent |
---|
79 | Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object. |
---|
80 | \end{enumerate} |
---|
81 | |
---|
82 | Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns. |
---|
83 | Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}. |
---|
84 | SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. |
---|
85 | MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. |
---|
86 | |
---|
87 | Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively. |
---|
88 | These approaches attempt to perform better load-balancing at the cost of affinity and contention. |
---|
89 | Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues |
---|
90 | |
---|
91 | Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent. |
---|
92 | Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads. |
---|
93 | Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals. |
---|
94 | More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). |
---|
95 | |
---|
96 | |
---|
97 | \section{Contributions}\label{s:Contributions} |
---|
98 | This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system: |
---|
99 | \begin{enumerate}[leftmargin=*] |
---|
100 | \item |
---|
101 | A scalable scheduling algorithm that offers progress guarantees. |
---|
102 | \item |
---|
103 | An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. |
---|
104 | \item |
---|
105 | Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@. |
---|
106 | \end{enumerate} |
---|