Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

Last change on this file was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 17 months ago
Last corrections to my thesis... hopefully
Property mode set to `100644`
File size: 13.1 KB

Line
1	\chapter{Introduction}\label{intro}
2
3	\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages, like Go~\cite{GITHUB:go}, Java's Project Loom~\cite{MAN:project-loom} and Kotlin~\cite{MAN:kotlin}.
4	The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
5	Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
6	To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
7	which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated.
8	Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
9	When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources?
10	Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
11	otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
12
13	This thesis analyzes multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.
14	The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
15	Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
16	Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases.
17
18	After examining, testing and selecting specific approaches to these scheduling issues, a new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
19	The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
20	The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
21
22	Chapter~\ref{intro} defines scheduling and its general goals.
23	Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others.
24	Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant effect on the new scheduler design and implementation.
25	Chapter~\ref{core} analyses different scheduler approaches while looking for scheduler mechanisms that provide both performance and fairness.
26	Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}.
27	Chapter~\ref{practice} presents the mechanisms needed to adjust the amount of parallelism, both manually and automatically.
28	Chapters~\ref{microbench} and~\ref{macrobench} present micro and macro benchmarks used to evaluate and compare the new scheduler with similar schedulers.
29
30
31	\section{Scheduling}\label{sched}
32	Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
33	On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}.
34	Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit\footnote{
35	Open systems constrasts to \newterm{closed} systems, where work never arrives from external sources.
36	This definition is a extension of open/closed systems in the field of thermodynamics.
37	}.
38
39	In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
40	For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
41	In an open system, a general-purpose dynamic scheduler cannot anticipate work requests, so its performance is rarely optimal.
42	Even with complete knowledge of arrival order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}.
43	However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
44
45	Scheduling occurs at discrete points when there are transitions in a system.
46	For example, a \at cycles through the following transitions during its execution.
47	\begin{center}
48	\input{executionStates.pstex_t}
49	\end{center}
50	These \newterm{state transition}s are initiated in response to events, \eg blocking, interrupts, errors:
51	\begin{itemize}
52	\item
53	entering the system (new $\rightarrow$ ready)
54	\item
55	scheduler assigns a \at to a computing resource, \eg CPU (ready $\rightarrow$ running)
56	\item
57	timer alarm for preemption (running $\rightarrow$ ready)
58	\item
59	long-term delay versus spinning (running $\rightarrow$ blocked)
60	\item
61	completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready)
62	\item
63	normal completion or error, \eg segment fault (running $\rightarrow$ halted)
64	\end{itemize}
65	Key to scheduling is that a \at cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
66
67	When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
68	Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
69	The structure of ready queues can take many different forms, where the basic two are the single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
70	\begin{center}
71	\begin{tabular}{l\|l}
72	\multicolumn{1}{c\|}{\textbf{SQMS}} & \multicolumn{1}{c}{\textbf{MQMS}} \\
73	\hline
74	\raisebox{0.5\totalheight}{\input{SQMS.pstex_t}} & \input{MQMSG.pstex_t}
75	\end{tabular}
76	\end{center}
77	Beyond these two schedulers are a host of options, \eg adding a global shared queue to MQMS or adding multiple private queues with distinct characteristics.
78
79	Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria:
80	\begin{enumerate}[leftmargin=*]
81	\item
82	\newterm{load balancing}: available work is distributed so no processor is idle when work is available.
83
84	\noindent
85	Eventual progress for each work unit is often an important consideration, \ie no starvation.
86	\item
87	\newterm{affinity}: processors access state through a complex memory hierarchy, so it is advantageous to keep a work unit's state on a single or closely bound set of processors.
88
89	\noindent
90	Essentially, all multiprocessor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
91	When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
92	That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors.
93
94	\item
95	\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking.\footnote{
96	Lock-free data-structures do not involve locking but incur similar costs to achieve mutual exclusion.}
97	Mutual exclusion cost and latency increase significantly with the number of processors access\-ing a shared object.
98	\end{enumerate}
99
100	At a high-level, scheduling is considered zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
101	Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.}
102	This almost invariably leads to schedulers needing to pick some \ats over others, opening the door to fairness problems.
103	However, at a lower-level, schedulers can make inefficient or incorrect decisions leading to strictly worse outcomes than what the theoretical zero-sum game suggests.
104	Since it can be difficult to avoid these poor decisions, schedulers are generally a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
105	For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
106	While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
107
108	Significant research effort has looked at load balancing by stealing/sharing work units among queues: when a ready queue is too short or long, respectively, load stealing/sharing schedulers attempt to push/pull work units to/from other ready queues.
109	These approaches attempt to perform better load-balancing at the cost of affinity and contention.
110	However, \emph{all} approaches come at a cost, but not all compromises are necessarily equivalent, especially across workloads.
111	Hence, some schedulers perform very well for specific workloads, while others offer acceptable performance over a wider range of workloads.
112
113	\section{\CFA programming language}
114
115	The \CFA programming language~\cite{Cforall,Moss18} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
116	Among its productivity features, \CFA supports \gls{uthrding}~\cite{Delisle21} as its fundamental threading model allowing programmers to easily write modern concurrent and parallel programs.
117	My previous master's thesis on concurrency in \CFA focused on features and interfaces~\cite{Delisle18}.
118	This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
119	Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling.
120	Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{
121	C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd}, reducing parallelism at the user level.}
122
123	Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals.
124	More specifically, safety and productivity for scheduling mean supporting a wide range of workloads, so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
125	The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
126	To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
127
128	As a research project, this work runs exclusively on newer versions of the Linux operating system and gcc/clang compilers.
129	The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
130	some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
131	The \io implementation is based on the @io_uring@ kernel interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
132	This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc.
133	While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility.
134	The hope is that these new features will soon become mainstream features.
135
136	\section{Contributions}\label{s:Contributions}
137	This work provides the following scheduling contributions for advanced \gls{uthrding} runtime systems:
138	\begin{enumerate}[leftmargin=*]
139	\item
140	A scalable scheduling algorithm that offers progress guarantees.
141	\item
142	Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
143	\item
144	An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
145	\item
146	A mechanism for adding fairness on top of the MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
147	\item
148	An optimization of the helping mechanism for load balancing to reduce scheduling costs.
149	\item
150	An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases.
151	\end{enumerate}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: