Context Navigation

source: doc/theses/thierry_delisle_PhD/comp_II/comp_II_PAB.tex @ 41870a5

ADTast-experimentalenumpthread-emulationqualifiedEnum

Last change on this file since 41870a5 was 1c412aa, checked in by Peter A. Buhr <pabuhr@…>, 5 years ago
final comments on Thierry's comp II
Property mode set to `100644`
File size: 40.8 KB

Line
1	\documentclass[11pt]{article}
2	\usepackage{fullpage}
3	\usepackage[T1]{fontenc}
4	\usepackage[utf8]{inputenc}
5	\usepackage{xspace}
6	\usepackage{xcolor}
7	\usepackage{graphicx}
8	\usepackage{epic,eepic}
9	\usepackage{listings} % for code listings
10	\usepackage{glossaries}
11	\usepackage{textcomp}
12	% cfa macros used in the document
13	\input{common}
14
15	\setlist{topsep=6pt,parsep=0pt} % global reduce spacing between points
16	\newcommand{\uC}{$\mu$\CC}
17	\usepackage[breaklinks=true,pagebackref=true,hidelinks]{hyperref}
18	\urlstyle{rm}
19	\setlength{\abovecaptionskip}{5pt plus 3pt minus 2pt}
20	\lstMakeShortInline$% % single-character for \lstinline
21	%\usepackage[margin=1in]{geometry}
22	%\usepackage{float}
23	%\usepackage{breakurl}
24
25	\input{glossary}
26
27	\CFAStyle % use default CFA format-style
28
29	\title{
30	\Huge \vspace*{1in} The \CFA Scheduler\\
31	\huge \vspace*{0.25in} PhD Comprehensive II Research Proposal
32	\vspace*{1in}
33	}
34
35	\author{
36	\huge Thierry Delisle \vspace*{5pt} \\
37	\Large \texttt{tdelisle@uwaterloo.ca} \vspace*{5pt} \\
38	\Large Cheriton School of Computer Science \\
39	\Large University of Waterloo
40	}
41
42	\date{
43	\today
44	}
45
46	\begin{document}
47	\maketitle
48	\cleardoublepage
49
50	\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
51	\newcommand{\TODO}{{\large\bf\color{red} TODO: }\xspace}
52
53	% ===============================================================================
54	% ===============================================================================
55
56	\tableofcontents
57
58	% ===============================================================================
59	% ===============================================================================
60	\newpage
61	\section{Introduction}
62	\subsection{\CFA and the \CFA concurrency package}
63	\CFA~\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
64	It aims to add high-productivity features while maintaining the predictable performance of C.
65	As such, concurrency in \CFA~\cite{Delisle19} aims to offer simple and safe high-level tools while still allowing performant code.
66	\CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing.
67	As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}.
68
69	\newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler.
70	This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively.
71	The cost of switching between two threads for an indirect handoff has two components:
72	\begin{enumerate}
73	\item
74	the cost of actually context-switching, \ie changing the relevant registers to move execution from one thread to the other,
75	\item
76	and the cost of scheduling, \ie deciding which thread to run next among all the threads ready to run.
77	\end{enumerate}
78	The first cost is generally constant and fixed\footnote{Affecting the constant context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}, while the scheduling cost can vary based on the system state.
79	Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention.
80
81	The more threads switch, the more the administration cost of scheduling becomes noticeable.
82	It is therefore important to build a scheduler with the lowest possible cost and latency.
83	Another important consideration is \newterm{fairness}.
84	In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}.
85	While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness.
86	Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance.
87	In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the the express cash-register at a grocery store.
88
89	The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance.
90	Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package.
91	Therefore, the main goal of this proposal is :
92	\begin{quote}
93	The \CFA scheduler should be \emph{viable} for \emph{any} workload.
94	\end{quote}
95
96	For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads.
97	As such, scheduling performance is generally either defined by the best case scenario, \ie a workload to which the scheduler is tailored, or the worst case scenario, \ie the scheduler behaves no worst than \emph{X}.
98	For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance.
99	Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler.
100	As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal.
101
102	To achieve the \CFA scheduling goal includes:
103	\begin{enumerate}
104	\item
105	producing a scheduling strategy with sufficient fairness guarantees,
106	\item
107	creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily,
108	\item
109	scheduling blocking I/O operations,
110	\item
111	and writing sufficient library tools to allow developers to indirectly use the scheduler, either through tuning knobs or replacing the default scheduler.
112	\end{enumerate}
113
114	% ===============================================================================
115	% ===============================================================================
116
117	\section{\CFA Scheduling}
118	To schedule user-level threads across all workloads, the scheduler has a number of requirements:
119
120	\paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount.
121	The scheduler cannot allow threads to be dropped from the ready queue, \ie scheduled but never run, or be executed multiple times when only being scheduled once.
122	Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup.
123	The \CFA scheduler must be correct.
124
125	\paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency.
126	\newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above.
127	For simple applications, where a single kernel thread does most of the scheduling, it is generally the dominating cost.
128	\newterm{Scalability} is the cost of adding multiple kernel threads because it increases the time for context switching because of contention by multiple threads accessing shared resources, \eg the ready queue.
129	Finally, \newterm{tail latency} is service delay and relates to thread fairness.
130	Specifically, latency measures how long a thread waits to run once scheduled and is evaluated in the worst case.
131	The \CFA scheduler should offer good performance for all three metrics.
132
133	\paragraph{Fairness} Like performance, this requirement has several aspect : eventual progress, predictability and performance reliability.
134	\newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation.
135	As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about.
136	\newterm{Predictability} and \newterm{reliability} means similar workloads achieve similar performance and programmer execution intuition is respected.
137	For example, a thread that yields aggressively should not run more often then other tasks.
138	While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers.
139	The \CFA scheduler must guarantee eventual progress and should be predictable and offer reliable performance.
140
141	\paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal.
142	\newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it.
143	Balancing these two states is where the complexity lies.
144	The \CFA scheduler should be efficient with respect to the underlying (shared) computer.
145
146	\bigskip To achieve these requirements, I can reject two broad types of scheduling strategies : feedback-based and priority schedulers.
147
148	\subsection{Feedback-Based Schedulers}
149	Many operating systems use schedulers based on feedback in some form, \eg measuring how much CPU a particular thread has used\footnote{Different metrics can be measured but it is not relevant to the discussion.} and schedule threads based on this metric.
150	These strategies are sensible for operating systems but rely on two assumptions for the workload:
151
152	\begin{enumerate}
153	\item Threads live long enough for useful feedback information to be to gathered.
154	\item Threads belong to multiple users so fairness across threads is insufficient.
155	\end{enumerate}
156
157	While these two assumptions generally hold for operating systems, they may not for user-level threading.
158	Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetimes that are only scheduled a few times.
159	Scheduling strategies based on feedback cannot be effective in these cases because there is no opportunity to measure the metrics that underlie the algorithm.
160	Note, the problem of \newterm{feedback convergence} (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, \eg threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly.
161
162	In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security.
163	When multiple users are involved, it is possible some users are malevolent and try to exploit the scheduling strategy to achieve some nefarious objective.
164	Security concerns mean more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process.
165	In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user.
166	Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent.
167	This approach allows for a much simpler fairness metric and in this proposal \emph{fairness} is defined as: when multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready-queue.
168
169	Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback.
170	Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next.
171
172	\subsection{Priority Schedulers}
173	Another broad category of schedulers are priority schedulers.
174	In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads.
175	Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO.
176	A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run.
177	This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems.
178
179	An important observation is that threads do not need to have explicit priorities for problems to occur.
180	Indeed, any system with multiple ready-queues that attempts to exhaust one queue before accessing the other queues, essentially provide implicit priority, which can encounter starvation problems.
181	For example, a popular scheduling strategy that suffers from implicit priorities is work stealing.
182	\newterm{Work stealing} is generally presented as follows:
183	\begin{enumerate}
184	\item Each processor has a list of ready threads.
185	\item Each processor runs threads from its ready queue first.
186	\item If a processor's ready queue is empty, attempt to run threads from some other processor's ready queue.
187	\end{enumerate}
188
189	In a loaded system\footnote{A \newterm{loaded system} is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block, or preempt for an extended period of time, threads on the same processor's list starve if no other processors exhaust their list.
190
191	Since priorities can be complex for programmers to incorporate into their execution intuition, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities.
192
193	\subsection{Schedulers without feedback or priorities}
194	This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about.
195	The simplest fairness guarantee is FIFO ordering, \ie threads scheduled first run first.
196	However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additional synchronization.
197	Thankfully, strict FIFO is not needed for sufficient fairness.
198	Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run.
199	Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a specific thread will \emph{eventually} run.
200	Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems.
201	For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering.
202
203	The \CFA scheduler fairness is defined as follows:
204	\begin{itemize}
205	\item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$.
206	\end{itemize}
207	While this is not a bounded guarantee, the probability that unfairness persist for long periods of times decreases exponentially, making persisting unfairness virtually impossible.
208
209	% ===============================================================================
210	% ===============================================================================
211	\section{Proposal Details}
212
213	\subsection{Central Ready Queue} \label{sec:queue}
214	A central ready queue can be built from a FIFO queue, where user threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user threads from the queue and execute them.
215	Alistarh \etal~\cite{alistarh2018relaxed} show it is straightforward to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems.
216	The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant.
217	Section~\ref{sec:resize} discusses resizing the array.}.
218	Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue.
219	Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp.
220	A higher number of underlying queues leads to less contention on each queue and therefore better performance.
221	In a loaded system, it is highly likely the queues are non-empty, \ie several tasks are on each of the underlying queues.
222	This means that selecting a queue at random to pop from is highly likely to yield a queue with available items.
223	In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10.
224
225	\begin{figure}
226	\begin{center}
227	\input{base}
228	\end{center}
229	\caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists.
230	The timestamp is in all nodes and cell arrays.}
231	\label{fig:base}
232	\end{figure}
233
234	\begin{figure}
235	\begin{center}
236	\input{empty}
237	\end{center}
238	\caption{``More empty'' state of the queue: the array contains many empty cells.}
239	\label{fig:empty}
240	\end{figure}
241
242	When the ready queue is \emph{more empty}, \ie several of the queues are empty, selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation.
243	Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time.
244	Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry.
245	Note, the popping kernel thread has no work to do, but CPU cycles are wasted both for available user and kernel threads during the pop operation as the popping thread is using a CPU.
246	Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density.
247
248	This leads to four performance cases for the centralized ready-queue, as depicted in Table~\ref{tab:perfcases}.
249	The number of processors (many or few) refers to the number of kernel threads \emph{actively} attempting to pop user threads from the queues, not the total number of kernel threads.
250	The number of threads (many or few) refers to the number of user threads ready to be run.
251	Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty.
252	Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}.
253
254	\begin{table}
255	\begin{center}
256	\begin{tabular}{\|r\|l\|l\|}
257	\cline{2-3}
258	\multicolumn{1}{r\|}{} & \multicolumn{1}{c\|}{Many Processors} & \multicolumn{1}{c\|}{Few Processors} \\
259	\hline
260	Many Threads & A: good performance & B: good performance \\
261	\hline
262	Few Threads & C: worst performance & D: poor performance \\
263	\hline
264	\end{tabular}
265	\end{center}
266	\caption{Expected performance of the relaxed FIFO list in different cases.}
267	\label{tab:perfcases}
268	\end{table}
269
270	Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used.
271	This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations.
272	The approach used to encode this information can vary in density and be either global or local.
273	\newterm{Density} means the information is either packed in a few cachelines or spread across several cachelines, and \newterm{local information} means each thread uses an independent copy instead of a single global, \ie common, source of information.
274
275	For example, Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use.
276	This approach means processors can often find user threads in constant time, regardless of how many underlying queues are empty.
277	Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues.
278	However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word.
279	With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically.
280
281	Finally, a dense bitmap, either single or multi-word, causes additional problems in case C (Table 1), because many processors are continuously scanning the bitmask to find the few available threads.
282	This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information.
283	This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available user threads but none on queue.
284
285	\begin{figure}
286	\begin{center}
287	{\resizebox{0.8\textwidth}{!}{\input{emptybit}}}
288	\end{center}
289	\caption{``More empty'' queue with added bitmask to indicate which array cells have items.}
290	\label{fig:emptybit}
291	\end{figure}
292
293	Figure~\ref{fig:emptytree} shows another approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US.
294	How does that affect \CFA? Can I use it in my work?}.
295	However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case.
296
297	\begin{figure}
298	\begin{center}
299	{\resizebox{0.8\textwidth}{!}{\input{emptytree}}}
300	\end{center}
301	\caption{``More empty'' queue with added binary search tree indicate which array cells have items.}
302	\label{fig:emptytree}
303	\end{figure}
304
305	Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independent copy of it.
306	While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem.
307	In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation.
308	A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct.
309	As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation.
310	Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct.
311
312	\begin{figure}
313	\begin{center}
314	\input{emptytls}
315	\end{center}
316	\caption{``More empty'' queue with added per processor bitmask to indicate which array cells have items.}
317	\label{fig:emptytls}
318	\end{figure}
319
320	There is a fundamental tradeoff among these approach.
321	Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case.
322	Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node.
323	Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}.
324	Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness.
325	The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting ways.
326	Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful.
327
328	\paragraph{Objectives and Existing Work}
329
330	How much scalability is actually needed is highly debatable.
331	\emph{libfibre}~\cite{libfibre} has compared favorably to other schedulers in webserver tests~\cite{Karsten20} and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask.
332	As such, the single atomic instruction on a shared cacheline may be sufficiently performant.
333
334	I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single int representing a thread and intrusive data fields.
335	Using this prototype I ran preliminary performance experiments that confirm the expected performance in Table~\ref{tab:perfcases}.
336	However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, \eg threads are not independent of each other, when a thread blocks some other thread must intervene to wake it.
337
338	I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results, as creating one-to-one comparisons between the prototype and the \CFA runtime will be complex.
339
340	\subsection{Dynamic Resizing} \label{sec:resize}
341
342	\begin{figure}
343	\begin{center}
344	\input{system}
345	\end{center}
346	\caption{Global structure of the \CFA runtime system.}
347	\label{fig:system}
348	\end{figure}
349
350	The \CFA runtime system groups processors together as \newterm{clusters}, as shown in Figure~\ref{fig:system}.
351	Threads on a cluster are always scheduled on one of the processors of the cluster.
352	Currently, the runtime handles dynamically adding and removing processors from clusters at any time.
353	Since this is part of the existing design, the proposed scheduler must also support this behaviour.
354	However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes.
355	This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system.
356	As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster.
357	How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times.
358	However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance.
359	The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks.
360	Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it, which can introduce memory reclamation problems if not done correctly.
361
362	\begin{figure}
363	\begin{center}
364	\input{resize}
365	\end{center}
366	\caption{Copy of data structure shown in Figure~\ref{fig:base}.}
367	\label{fig:base2}
368	\end{figure}
369
370	It is important to note how the array is used in this case.
371	While the array cells are modified by every push and pop operation, the array itself, \ie the pointer that would change when resized, is only read during these operations.
372	Therefore the use of this pointer can be described as frequent reads and infrequent writes.
373	This description effectively matches with the description of a reader-writer lock, infrequent but invasive updates among frequent read operations.
374	In the case of the ready queue described above, read operations are operations that push or pop from the ready queue but do not invalidate any references to the ready queue data structures.
375	Writes on the other hand would add or remove inner queues, invalidating references to the array of inner queues in a process.
376	Therefore, the current proposed approach to this problem is to add a per-cluster reader-writer lock around the ready queue to prevent restructuring of the ready-queue data-structure while threads are being pushed or popped.
377
378	There are possible alternatives to the reader-writer lock solution.
379	This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject~\cite{michael2004hazard, brown2015reclaiming}.
380	However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution.
381
382	\paragraph{Objectives and Existing Work}
383	The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck.
384	I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target.
385	As such, I do not consider this lock to be a risk for this project.
386
387	\subsection{Idle Sleep} \label{sec:sleep}
388
389	\newterm{Idle sleep} is the process of putting processors to sleep when they have no threads to execute.
390	In this context, processors are kernel threads and sleeping refers to asking the kernel to block a thread.
391	This operation can be achieved with either thread synchronization operations like $pthread_cond_wait$ or using signal operations like $sigsuspend$.
392	The goal of putting idle processors to sleep is:
393	\begin{enumerate}
394	\item
395	reduce contention on the ready queue, since the otherwise idle processors generally contend trying to pop items from the queue,
396	\item
397	give back unneeded CPU time associated with a process to other user processors executing on the computer,
398	\item
399	and reduce energy consumption in cases where more idle kernel-threads translate to idle CPUs, which can cycle down.
400	\end{enumerate}
401	Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake up.
402
403	When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread.
404	This operation is equivalent to the classic problem of missing signals when using condition variables: the ``sleepy'' processor indicates its intention to block but has not yet gone to sleep when another processor attempts to wake it up.
405	The waking-up operation sees the blocked process and signals it, but the blocking process is racing to sleep so the signal is missed.
406	In cases where kernel threads are managed as processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually.
407	Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable).
408	However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster.
409	In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} write \CFA code that leads to a deadlock.}.
410	Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed.
411	For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock.
412	To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor.
413	This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools.
414
415	Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost.
416	This scenario happens when a program oscillates between high and low activity, needing most and then less processors.
417	A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up.
418	This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer.
419	Note that while this generally means organizing the processors in a stack, I believe that the unique index provided in my reader-writer lock can be reused to strictly order the waking processors, causing a mostly LIFO order.
420	While a strict LIFO stack is probably better, the processor index could prove useful for other reasons, while still offering a sufficiently LIFO ordering.
421
422	A final important aspect of idle sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up.
423	Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to sub-optimal throughput.
424	Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency.
425	There is already a wealth of research on the subject~\cite{schillings1996engineering, wiki:thunderherd} and I may use an existing approach for the idle-sleep heuristic in this project, \eg~\cite{Karsten20}.
426
427	\subsection{Asynchronous I/O}
428
429	The final aspect of this proposal is asynchronous I/O.
430	Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput.
431	It is preferable to block the user thread performing the I/O and reuse the underlying kernel-thread to run other ready user threads.
432	This approach requires intercepting user-thread calls to I/O operations, redirecting them to an asynchronous I/O interface, and handling the multiplexing/demultiplexing between the synchronous and asynchronous API.
433	As such, there are three components needed to implemented support for asynchronous I/O:
434	\begin{enumerate}
435	\item
436	an OS abstraction layer over the asynchronous interface,
437	\item
438	an event-engine to (de)multiplex the operations,
439	\item
440	and a synchronous interface for users to use.
441	\end{enumerate}
442	None of these components currently exist in \CFA and I will need to build all three for this project.
443
444	\paragraph{OS Abstraction}
445	One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations.
446	While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API.
447	It is sufficient to make one work in the complex context of the \CFA runtime.
448	\uC uses the $select$~\cite{select} as its interface, which handles ttys, pipes and sockets, but not disk.
449	$select$ entails significant complexity and is being replaced in UNIX operating-systems, which make it a less interesting alternative.
450	Another popular interface is $epoll$~\cite{epoll}, which is supposed to be cheaper than $select$.
451	However, $epoll$ also does not handle the file system and anecdotal evidence suggest it has problem with linux pipes and $TTY$s.
452	A popular cross-platform alternative is $libuv$~\cite{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features).
453	However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together.
454	A very recent alternative that I am investigating is $io_uring$~\cite{io_uring}.
455	It claims to address some of the issues with $epoll$ and my early investigating suggest that the claim is accurate.
456	$io_uring$ uses a much more general approach where system calls are register to a queue and later executed by the kernel, rather than relying on system calls to return an error instead of blocking and subsequently waiting for changes on file descriptors.
457	I believe this approach allows for fewer problems, \eg the manpage for $open$~\cite{open} states:
458	\begin{quote}
459	Note that [the $O_NONBLOCK$ flag] has no effect for regular files and block devices;
460	that is, I/O operations will (briefly) block when device activity is required, regardless of whether $O_NONBLOCK$ is set.
461	Since $O_NONBLOCK$ semantics might eventually be implemented, applications should not depend upon blocking behavior when specifying this flag for regular files and block devices.
462	\end{quote}
463	This makes approach based on $epoll$/$select$ less reliable since they may not work for every file descriptors.
464	For this reason, I have started to use $io_uring$ as the OS abstraction for the \CFA runtime, until further work shows problems I have not encountered yet.
465	However, only a small subset of the features are available in Ubuntu as of April 2020~\cite{wiki:ubuntu-linux}, which limits performance comparisons.
466	The currently available features are sufficient to start basic socket performance experiments and comparisons.
467
468	\paragraph{Event Engine}
469	Laying on top of the asynchronous interface layer is the event engine.
470	This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user-threads.
471	This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors.
472	Decisions that need to be made include:
473	\begin{enumerate}
474	\item
475	whether to poll from a separate kernel thread or a regularly scheduled user thread,
476	\item
477	what should be the ordering used when results satisfy many requests,
478	\item
479	how to handle threads waiting for multiple operations, etc.
480	\end{enumerate}
481
482	\paragraph{Interface}
483	Finally, for these non-blocking I/O components to be available, it is necessary to expose them through a synchronous interface because that is the \CFA concurrent programming style.
484	The interface can add novel features but it is preferable to match the existing POSIX interface when possible to be compatible with existing code.
485	Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort.
486	Where new functionality is needed, I will create a novel interface extensions to fill gaps and provide advanced features.
487
488
489	% ===============================================================================
490	% ===============================================================================
491	\section{Discussion}
492	I believe that runtime systems and scheduling are still open topics.
493	Many ``state of the art'' production frameworks still use single threaded event-loops because of performance considerations, \eg~\cite{nginx-design}, and, to my knowledge, no widely available system language offers modern threading facilities.
494	I believe the proposed work offers a novel runtime and scheduling package embedded in the \CFA programming language, where existing work only offers fragments that users must assemble themselves when possible.
495
496	% ===============================================================================
497	% ===============================================================================
498	\section{Timeline}
499	\begin{center}
500	\begin{tabular}{ \| r @{--} l \| p{4in} \| }
501	\hline May 2020 & October 2020 & Creation of the performance benchmark. \\
502	\hline November 2020 & March 2021 & Completion of the implementation. \\
503	\hline March 2021 & April 2021 & Final Performance experiments. \\
504	\hline May 2017 & August 2021 & Thesis writing and defense. \\
505	\hline
506	\end{tabular}
507	\end{center}
508
509	% B I B L I O G R A P H Y
510	% -----------------------------
511	\cleardoublepage
512	\phantomsection % allows hyperref to link to the correct page
513	\addcontentsline{toc}{section}{\refname}
514	\lstDeleteShortInline$% turn off as $ used in bibliography entries
515	\bibliographystyle{plain}
516	\bibliography{pl,local}
517
518	% G L O S S A R Y
519	% -----------------------------
520	\cleardoublepage
521	\phantomsection % allows hyperref to link to the correct page
522	\addcontentsline{toc}{section}{Glossary}
523	\printglossary
524
525	\end{document}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: