| 1 | \chapter{Introduction}\label{intro} | 
|---|
| 2 | \section{\CFA programming language} | 
|---|
| 3 |  | 
|---|
| 4 | The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. | 
|---|
| 5 | Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs. | 
|---|
| 6 | My previous master's thesis on concurrent in \CFA focused on features and interfaces. | 
|---|
| 7 | This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. | 
|---|
| 8 | Specifically, this work concentrates on scheduling and \glsxtrshort{io}. | 
|---|
| 9 | Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}. | 
|---|
| 10 |  | 
|---|
| 11 | As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. | 
|---|
| 12 | While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work. | 
|---|
| 13 |  | 
|---|
| 14 | \section{Scheduling} | 
|---|
| 15 | Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones. | 
|---|
| 16 | On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}. | 
|---|
| 17 | These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit. | 
|---|
| 18 | On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization. | 
|---|
| 19 | A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal. | 
|---|
| 20 | With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}. | 
|---|
| 21 | However, optimal solutions are often not required. | 
|---|
| 22 | Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns. | 
|---|
| 23 |  | 
|---|
| 24 | Scheduling occurs at discreet points when there are transitions in a system. | 
|---|
| 25 | For example, a thread cycles through the following transitions during its execution. | 
|---|
| 26 | \begin{center} | 
|---|
| 27 | \input{executionStates.pstex_t} | 
|---|
| 28 | \end{center} | 
|---|
| 29 | These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s): | 
|---|
| 30 | \begin{itemize} | 
|---|
| 31 | \item | 
|---|
| 32 | entering the system (new $\rightarrow$ ready) | 
|---|
| 33 | \item | 
|---|
| 34 | timer alarm for preemption (running $\rightarrow$ ready) | 
|---|
| 35 | \item | 
|---|
| 36 | long term delay versus spinning (running $\rightarrow$ blocked) | 
|---|
| 37 | \item | 
|---|
| 38 | blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready) | 
|---|
| 39 | \item | 
|---|
| 40 | normal completion or error, \ie segment fault (running $\rightarrow$ halted) | 
|---|
| 41 | \item | 
|---|
| 42 | scheduler assigns a thread to a resource (ready $\rightarrow$ running) | 
|---|
| 43 | \end{itemize} | 
|---|
| 44 | Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system. | 
|---|
| 45 |  | 
|---|
| 46 | When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. | 
|---|
| 47 | Ready queues organize threads for scheduling, which indirectly organizes the work to be performed. | 
|---|
| 48 | The structure of ready queues can take many different forms. | 
|---|
| 49 | Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS). | 
|---|
| 50 | \begin{center} | 
|---|
| 51 | \begin{tabular}{l|l} | 
|---|
| 52 | \multicolumn{1}{c|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\ | 
|---|
| 53 | \hline | 
|---|
| 54 | \raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t} | 
|---|
| 55 | \end{tabular} | 
|---|
| 56 | \end{center} | 
|---|
| 57 | Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS. | 
|---|
| 58 |  | 
|---|
| 59 | The three major optimization criteria for a scheduler are: | 
|---|
| 60 | \begin{enumerate}[leftmargin=*] | 
|---|
| 61 | \item | 
|---|
| 62 | \newterm{load balancing}: available work is distributed so no processor is idle when work is available. | 
|---|
| 63 |  | 
|---|
| 64 | \noindent | 
|---|
| 65 | Eventual progress for each work unit is often an important consideration, \ie no starvation. | 
|---|
| 66 | \item | 
|---|
| 67 | \newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors. | 
|---|
| 68 |  | 
|---|
| 69 | \noindent | 
|---|
| 70 | Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. | 
|---|
| 71 | When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. | 
|---|
| 72 | That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors. | 
|---|
| 73 |  | 
|---|
| 74 | \item | 
|---|
| 75 | \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{ | 
|---|
| 76 | Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.} | 
|---|
| 77 |  | 
|---|
| 78 | \noindent | 
|---|
| 79 | Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object. | 
|---|
| 80 | \end{enumerate} | 
|---|
| 81 |  | 
|---|
| 82 | Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns. | 
|---|
| 83 | Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}. | 
|---|
| 84 | SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. | 
|---|
| 85 | MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. | 
|---|
| 86 |  | 
|---|
| 87 | Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively. | 
|---|
| 88 | These approaches attempt to perform better load-balancing at the cost of affinity and contention. | 
|---|
| 89 | Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues | 
|---|
| 90 |  | 
|---|
| 91 | Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent. | 
|---|
| 92 | Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads. | 
|---|
| 93 | Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals. | 
|---|
| 94 | More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). | 
|---|
| 95 |  | 
|---|
| 96 |  | 
|---|
| 97 | \section{Contributions}\label{s:Contributions} | 
|---|
| 98 | This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system: | 
|---|
| 99 | \begin{enumerate}[leftmargin=*] | 
|---|
| 100 | \item | 
|---|
| 101 | A scalable scheduling algorithm that offers progress guarantees. | 
|---|
| 102 | \item | 
|---|
| 103 | An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. | 
|---|
| 104 | \item | 
|---|
| 105 | Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@. | 
|---|
| 106 | \end{enumerate} | 
|---|