[6e8d446] | 1 | \chapter{Introduction}\label{intro} |
---|
[adf03a6] | 2 | \section{\CFA programming language} |
---|
| 3 | |
---|
| 4 | The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. |
---|
| 5 | Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs. |
---|
| 6 | My previous master's thesis on concurrent in \CFA focused on features and interfaces. |
---|
| 7 | This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. |
---|
| 8 | Specifically, this work concentrates on scheduling and \glsxtrshort{io}. |
---|
| 9 | Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}. |
---|
| 10 | |
---|
| 11 | As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. |
---|
| 12 | While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work. |
---|
| 13 | |
---|
| 14 | \section{Scheduling} |
---|
| 15 | Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones. |
---|
| 16 | On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}. |
---|
| 17 | These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit. |
---|
| 18 | On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization. |
---|
| 19 | A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal. |
---|
| 20 | With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}. |
---|
| 21 | However, optimal solutions are often not required. |
---|
| 22 | Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns. |
---|
[b9537e6] | 23 | |
---|
[6e8d446] | 24 | Scheduling occurs at discreet points when there are transitions in a system. |
---|
| 25 | For example, a thread cycles through the following transitions during its execution. |
---|
| 26 | \begin{center} |
---|
| 27 | \input{executionStates.pstex_t} |
---|
| 28 | \end{center} |
---|
| 29 | These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s): |
---|
| 30 | \begin{itemize} |
---|
| 31 | \item |
---|
| 32 | entering the system (new $\rightarrow$ ready) |
---|
| 33 | \item |
---|
| 34 | timer alarm for preemption (running $\rightarrow$ ready) |
---|
| 35 | \item |
---|
| 36 | long term delay versus spinning (running $\rightarrow$ blocked) |
---|
| 37 | \item |
---|
| 38 | blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready) |
---|
| 39 | \item |
---|
| 40 | normal completion or error, \ie segment fault (running $\rightarrow$ halted) |
---|
[adf03a6] | 41 | \item |
---|
| 42 | scheduler assigns a thread to a resource (ready $\rightarrow$ running) |
---|
[6e8d446] | 43 | \end{itemize} |
---|
| 44 | Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system. |
---|
| 45 | |
---|
| 46 | When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. |
---|
| 47 | Ready queues organize threads for scheduling, which indirectly organizes the work to be performed. |
---|
[adf03a6] | 48 | The structure of ready queues can take many different forms. |
---|
| 49 | Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS). |
---|
[6e8d446] | 50 | \begin{center} |
---|
| 51 | \begin{tabular}{l|l} |
---|
[adf03a6] | 52 | \multicolumn{1}{c|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\ |
---|
[6e8d446] | 53 | \hline |
---|
| 54 | \raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t} |
---|
| 55 | \end{tabular} |
---|
| 56 | \end{center} |
---|
[adf03a6] | 57 | Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS. |
---|
[6e8d446] | 58 | |
---|
| 59 | The three major optimization criteria for a scheduler are: |
---|
| 60 | \begin{enumerate}[leftmargin=*] |
---|
| 61 | \item |
---|
| 62 | \newterm{load balancing}: available work is distributed so no processor is idle when work is available. |
---|
| 63 | |
---|
| 64 | \noindent |
---|
| 65 | Eventual progress for each work unit is often an important consideration, \ie no starvation. |
---|
| 66 | \item |
---|
| 67 | \newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors. |
---|
| 68 | |
---|
| 69 | \noindent |
---|
[adf03a6] | 70 | Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. |
---|
| 71 | When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. |
---|
[6e8d446] | 72 | That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors. |
---|
| 73 | |
---|
| 74 | \item |
---|
[adf03a6] | 75 | \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{ |
---|
| 76 | Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.} |
---|
[6e8d446] | 77 | |
---|
| 78 | \noindent |
---|
[adf03a6] | 79 | Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object. |
---|
[6e8d446] | 80 | \end{enumerate} |
---|
| 81 | |
---|
[adf03a6] | 82 | Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns. |
---|
| 83 | Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}. |
---|
| 84 | SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. |
---|
| 85 | MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. |
---|
| 86 | |
---|
[6e8d446] | 87 | Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively. |
---|
| 88 | These approaches attempt to perform better load-balancing at the cost of affinity and contention. |
---|
| 89 | Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues |
---|
| 90 | |
---|
[adf03a6] | 91 | Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent. |
---|
| 92 | Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads. |
---|
| 93 | Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals. |
---|
| 94 | More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). |
---|
[6e8d446] | 95 | |
---|
| 96 | |
---|
[adf03a6] | 97 | \section{Contributions}\label{s:Contributions} |
---|
[6e8d446] | 98 | This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system: |
---|
| 99 | \begin{enumerate}[leftmargin=*] |
---|
| 100 | \item |
---|
[adf03a6] | 101 | A scalable scheduling algorithm that offers progress guarantees. |
---|
[6e8d446] | 102 | \item |
---|
[adf03a6] | 103 | An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. |
---|
[6e8d446] | 104 | \item |
---|
[adf03a6] | 105 | Support for user-level \glsxtrshort{io} capabilities based on Linux's \texttt{io\_uring}. |
---|
[6e8d446] | 106 | \end{enumerate} |
---|