source: doc/theses/thierry_delisle_PhD/thesis/text/intro.tex @ 749cf69

ADTast-experimentalpthread-emulation
Last change on this file since 749cf69 was 847bb6f, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago

proofread chapter text/io.tex, and updates in other chapaters

  • Property mode set to 100644
File size: 7.4 KB
Line 
1\chapter{Introduction}\label{intro}
2\section{\CFA programming language}
3
4The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
5Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
6My previous master's thesis on concurrent in \CFA focused on features and interfaces.
7This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
8Specifically, this work concentrates on scheduling and \glsxtrshort{io}.
9Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}.
10
11As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
12While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
13
14\section{Scheduling}
15Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones.
16On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}.
17These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit.
18On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.
19A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.
20With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}.
21However, optimal solutions are often not required.
22Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns.
23
24Scheduling occurs at discreet points when there are transitions in a system.
25For example, a thread cycles through the following transitions during its execution.
26\begin{center}
27\input{executionStates.pstex_t}
28\end{center}
29These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):
30\begin{itemize}
31\item
32entering the system (new $\rightarrow$ ready)
33\item
34timer alarm for preemption (running $\rightarrow$ ready)
35\item
36long term delay versus spinning (running $\rightarrow$ blocked)
37\item
38blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready)
39\item
40normal completion or error, \ie segment fault (running $\rightarrow$ halted)
41\item
42scheduler assigns a thread to a resource (ready $\rightarrow$ running)
43\end{itemize}
44Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system.
45
46When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
47Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
48The structure of ready queues can take many different forms.
49Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
50\begin{center}
51\begin{tabular}{l|l}
52\multicolumn{1}{c|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\
53\hline
54\raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t}
55\end{tabular}
56\end{center}
57Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS.
58
59The three major optimization criteria for a scheduler are:
60\begin{enumerate}[leftmargin=*]
61\item
62\newterm{load balancing}: available work is distributed so no processor is idle when work is available.
63
64\noindent
65Eventual progress for each work unit is often an important consideration, \ie no starvation.
66\item
67\newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors.
68
69\noindent
70Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
71When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
72That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors.
73
74\item
75\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{
76Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.}
77
78\noindent
79Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object.
80\end{enumerate}
81
82Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns.
83Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}.
84SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
85MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
86
87Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.
88These approaches attempt to perform better load-balancing at the cost of affinity and contention.
89Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues
90
91Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent.
92Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads.
93Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals.
94More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
95
96
97\section{Contributions}\label{s:Contributions}
98This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:
99\begin{enumerate}[leftmargin=*]
100\item
101A scalable scheduling algorithm that offers progress guarantees.
102\item
103An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
104\item
105Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
106\end{enumerate}
Note: See TracBrowser for help on using the repository browser.