source:doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex@41efd33

Last change on this file since 41efd33 was 41efd33, checked in by Thierry Delisle <tdelisle@…>, 3 years ago

New version of the introduction

• Property mode set to 100644
File size: 31.2 KB
Line
1\documentclass[11pt,fullpage]{article}
2\usepackage[T1]{fontenc}
3\usepackage[utf8]{inputenc}
4\usepackage{listings}           % for code listings
5\usepackage{xspace}
6\usepackage{xcolor}
7\usepackage{graphicx}
9\usepackage{glossaries}
10\usepackage{textcomp}
11\usepackage{geometry}
12
13% cfa macros used in the document
14\input{common}
15\input{glossary}
16
17\CFAStyle                               % use default CFA format-style
18
19\title{
20        \Huge \vspace*{1in} The \CFA Scheduler\\
21        \huge \vspace*{0.25in} PhD Comprehensive II Research Proposal
22        \vspace*{1in}
23}
24
25\author{
26        \huge Thierry Delisle \\
27        \Large \vspace*{0.1in} \texttt{tdelisle@uwaterloo.ca} \\
28        \Large Cheriton School of Computer Science \\
29        \Large University of Waterloo
30}
31
32\date{
33        \today
34}
35
36\begin{document}
37\maketitle
38\cleardoublepage
39
40\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
41\newcommand{\TODO}{~\newline{\large\bf\color{red} TODO :}\xspace}
42
43% ===============================================================================
44% ===============================================================================
45
46\tableofcontents
47
48% ===============================================================================
49% ===============================================================================
50\newpage
51\section{Introduction}
52\subsection{\CFA and the \CFA concurrency package}
53\CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high productivity features while maintaning the predictible performance of C. As such, concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. Concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such, the \CFA scheduler is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}.
54
56
57The more threads switch among each other, the more the administrating cost of scheduling becomes noticeable. It is therefore important to attempt to build a scheduler with the lowest possible cost and latency. Another important consideration is to take into account fairness. In principle, scheduling should give the illusion that all threads that are ready to run, are running simultaneously. This illusion is easier to reason about but can break down if the scheduler allows to much unfairness. Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but no more to help performance.
58
59The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. Therefore, the main goal of this proposal is :
60\begin{quote}
61The \CFA scheduler should be \emph{viable} for \emph{any} workload.
62\end{quote}
63
64For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future. As such, scheduling performance is generally either defined by the best case scenario, a workload to which the scheduler is tailored, or the worst case scenario i.e., the scheduler behaves no worst than \emph{X}. For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance.
65
66This objective includes producing a scheduling strategy with sufficient fairness guarantees, creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily and hide blocking I/O operations, and writing sufficient library tools to allow developers to properly use the scheduler.
67
68% ===============================================================================
69% ===============================================================================
70
71\section{Scheduling for \CFA}
72While the \CFA concurrency package does not have any particular scheduling requirements beyond using \glspl{uthrd}, it is important that only programmers with exceptionnaly high performance requirements should need to write their own scheduler. We can therefore more precisely detail the requirements of the \CFA scheduler :
73
74\paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. The scheduler cannot allow threads to be dropped from the ready-queue, i.e., scheduled but never run, or be executed multiple times when only being scheduled once. Since \CFA does not already allow spurious wakeup, this definition of correctness also means the scheduler should not introduce spurious wakeups. The \CFA scheduler must be correct.
75
76\paragraph{Performance} The performance of a scheduler can generally be mesured in terms of scheduling cost, scalability and latency. Scheduling cost is the cost to switch from one thread to another, as mentioned above. For simple applications where a single kernel thread will do most of the scheduling, it is generally the dominating cost. When adding many kernel threads, scalability can become an issue, effectively increasing the cost of context-switching when contention is high. Finally, a third axis of performance is tail latency. This measurement is related to fairness and mesures how long is needed for a thread to be run once scheduled but evaluated in the worst cases. The \CFA scheduler should offer good performance in all three metrics.
77
78\paragraph{Fairness} Like performance, this requirements has several aspect : eventual progress, predictability and performance reliablility. As a hard requirement, the \CFA must guarantee eventual progress, i.e., prevent starvation, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complext to reason about. Beyond this requirement, performance should be predictible and reliable. This means similar workload achieve similar performance and programmer intuition is respected. An example of this is : a thread that yield agressively should not run more often then other tasks. While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. The \CFA scheduler must guarantee eventual progress and should be predictible and offer reliable performance.
79
80\paragraph{Efficiency} Finally, efficient usage of CPU ressources is also an important requirement. This is discussed more in depth towards the end of this proposal. It effectively refers to avoid using CPU power when there are no threads to run, and conversly, use all CPUs available when the workload can benefit from it. Balancing these two states is where the complexity lies. The \CFA scheduler should be efficient.
81
82To achieve these requirements, I believe two broad types of scheduling strategies should be avoided : feedback-based and priority schedulers.
83
84\subsection{Feedback-Based Schedulers}
85Many operating systems use schedulers based on feedback in some form; e.g., measuring how much CPU a particular thread has used\footnote{Different metrics can measured here but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions on the workload :
86
87\begin{enumerate}
88        \item Threads live long enough for useful feedback information to be to gathered.
89        \item Threads belong to multiple users so fairness across threads is insufficient.
90\end{enumerate}
91
92While these two assumptions generally hold for operating systems, they may not for user-level threading. Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetime, only being scheduled a few times. Scheduling strategies based on feedback cannot be effective in these cases because they do not have the opportunity to measure the metrics that underlay the algorithm. Note that the problem of feedback convergence (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling event, e.g., threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly.
93
94In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible that some users are malevolent and try to exploit the scheduling strategy in order to achieve some nefarious objective. Security concerns mean that more precise and robust fairness metrics must be used to guarantee fairness across users as well as threads. In the case of the \CFA scheduler, every thread runs in the same user-space and are controlled from the same user. Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. This allows for a much simpler fairness metric and in this proposal fairness'' is considered as follows : when multiple threads are cycling through the system, the total ordering of threads being scheduled, i.e., pushed onto the readyqueue, should not differ much from the total ordering of threads being executed, i.e., popped from the readyqueue.
95
96Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. Feedback in general is not rejected for secondary concerns like idle sleep, but no feedback is used to decide which thread to run next.
97
98\subsection{Priority Schedulers}
100
101An important observation to make is that threads do not need to have explicit priorities for problems to be possible. Indeed, any system with multiple ready-queues and attempts to exhaust one queue before accessing the other queues, could encounter starvation problems. A popular scheduling strategy that suffers from implicit priorities is work-stealing. Work-stealing is generally presented as follows :
102
103\begin{itemize}
104        \item Each processor has a list of threads.
105\end{itemize}
106\begin{enumerate}
107        \item Run threads from this'' processor's list.
108        \item If this'' processor's list is empty, run threads from some other processor's list.
109\end{enumerate}
110
111In a loaded system\footnote{A loaded system is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield or block for an extended period of time, threads on the same processor list starve if no other processors exhaust their list.
112
113Since priorities can be complex to handle for programmers, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities.
114
115\subsection{Schedulers without feedback or priorities}
116I claim that the ideal default scheduler for the \CFA runtime is a scheduler that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is FIFO ordering, i.e., threads scheduled first run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additionnal synchronization. Thankfully, strict FIFO is not needed for sufficient fairness. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run\footnote{This is because the non-determinism means that programmers must already handle ordering problems in order to produce correct code and already must rely on weak guarantees, for example that a specific thread will \emph{eventually} run.}. Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler guarantees \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering.
117
118It is defined as follows :
119\begin{itemize}
120        \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regards to $N$.
121\end{itemize}
122
123While this is not a bounded guarantee, the probability that problems persist for long periods of times decreases exponentially, making persisting problems virtually impossible.
124
125% ===============================================================================
126% ===============================================================================
127% \section{Proposal}
128
130% Using trevor's paper\cit as basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queue. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the push and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue for which the head has the oldest timestamp. In loaded or overloaded systems, it is higly likely that the queues is far from empty, e.i., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is higly likely to yield a queue that is not empty.
131
133
134% \TODO double check the next sentence
135% Preliminary result indicate that the bitmask approach with the BMI2 extension can lead to multi-threaded performance that is contention agnostic in the worst case.
137
138% In cases where this is insufficient, another approach is to use a hiearchical data structure. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cit(SNZI: Scalable NonZero Indicators)\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer single-threaded performance due to the inherent pointer chasing, as such, it was not considered as the first approach but as a fallback in case the bitmask approach does not satisfy the performance goals.
139
140% Part of this performance relies on contention being low when there are few threads on the readyqueue. However, this can be assumed reliably if the system handles putting idle processors to sleep, which is addressed in section \ref{sleep}.
141
142% \paragraph{Objectives and Existing Work}
143% How much scalability is actually needed is highly debatable, libfibre\cit is has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such the single atomic instruction on a shared cacheline may be sufficiently performant.
144
145% I have built a prototype of this ready-queue (including the bitmask and BMI2 usage, but not the sharded bitmask) and ran performance experiments on it but it is difficult to compare this prototype to a thread scheduler as the prototype is used as a data-queue. I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results. I believe that the bitmask approach is currently one of the larger risks of the proposal, early tests lead me to believe it may work but it is not clear that the contention problem can be overcome. The worst-case scenario is a case where the number of processors and the number of ready threads are similar, yet scheduling events are very frequent. Fewer threads should lead to the Idle Sleep mechanism reducing contention while having many threads ready leads to optimal performance. It is difficult to evaluate the likeliness of this worst-case scenario in real workloads. I believe, frequent scheduling events suggest a more bursty'' workload where new work is finely divided among many threads which race to completion. This type of workload would only see a peek of contention close to the end of the work, but no sustained contention. Very fine-grained pipelines are less bursty'', these may lead to more sustained contention. However, they could also easily benefit from a direct hand-off strategy which would circumvent the problem entirely.
146
147% \subsection{Dynamic Resizing}
149
150% There are possible alternatives to the Reader Writer lock solution. This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject. However, the RWlock solution is simple and can be leveraged to solve other problems (e.g. processor ordering and memory reclamation of threads) which makes it an attractive solution.
151
152% \paragraph{Objectives and Existing Work}
153% The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottle neck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk on this project.
154
155% \subsection{Idle Sleep} \label{sleep}
156% As mentionned above, idle sleep is the process of putting processors to sleep while they do not have threads to execute. In this context processors are kernel-threads and sleeping refers to asking the kernel to block a thread. This can be achieved with either thread synchronization operations like pthread\_cond\_wait or using signal operations like sigsuspend.
157
158% Support for idle sleep broadly involves calling the operating system to block the kernel thread but also handling the race between the sleeping and the waking up, and handling which kernel thread should sleep or wake-up.
159
160% When a processor decides to sleep, there is a race that occurs between it signalling that it will go to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This is equivalent to the classic problem of missing signals when using condition variables, the sleepy'' processor indicates that it will sleep but has not yet gone to sleep, if another processor attempts to wake it up, the waking-up operation may claim nothing needs to be done and the signal will have been missed. In cases where threads are scheduled from processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake. Individual processors always finish shceduling threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready-queue is not linearizable). However, this guarantee does not hold if threads are shceduled from outside the cluster, either due to an external event like timers and I/O, or due to a thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking where it should not\footnote{Clusters should'' never deadlock, but for this proposal, cases where \CFA users \emph{actually} wrote \CFA code that leads to a deadlock it is considered as a deadlock that should'' happen. }. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a handshake'' where it is guaranteed that either~: the sleepy processor notices that a thread was scheduled after it signalled its intent to block or code scheduling threads well see the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthread offers few tools to implement this solution and offers no guarantee of ordering of threads waking up for most of these tools.
161
162% Another issues is trying to avoid kernel sleeping and waking frequently. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping hot'' processors warmer. Note that while this generally means organising the processors in a stack, I believe that the unique index provided by the ReaderWriter lock can be reused to strictly order the waking order of processors, causing a LIFO like waking order. While a strict LIFO stack is probably better, using the processor index could proove useful and offer a sufficiently LIFO ordering.
163
164% Finally, another important aspect of Idle Sleep is when should processors make the decision to sleep and when it is appropriate for sleeping processors to be woken up. Processors that are unnecessarily awake lead to unnecessary contention and power consumption, while too many sleeping processors can lead to sub-optimal throughput. Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. There is already a wealth of research on the subject and I do not plan to implement a novel idea for the Idle Sleep heuristic in this project.
165
166% \subsection{Asynchronous I/O}
167% The final aspect of this proposal is asynchronous I/O. Without it, user threads that execute I/O operations will block the underlying kernel thread. This leads to poor throughput, it would be preferrable to block the user-thread and reuse the underlying kernel-thread to run other ready threads. This requires intercepting the user-threads' calls to I/O operations, redirecting them to an asynchronous I/O interface and handling the multiplexing between the synchronous and asynchronous API. As such, these are the three components needed to implemented to support asynchronous I/O : an OS abstraction layer over the asynchronous interface, an event-engine to (de)multiplex the operations and a synchronous interface for users to use. None of these components currently exist in \CFA and I will need to build all three for this project.
168
169% \paragraph{OS Abstraction}
170% One of the fundamental part of this converting blocking I/O operations into non-blocking ones. This relies on having an underlying asynchronous I/O interface to which to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API, simply to use an existing one that is sufficient. uC++ uses the \texttt{select} as its interface, which handles pipes and sockets. It entails significant complexity and has performances problems which make it a less interesting alternative. Another interface which is becoming popular recently\cit is \texttt{epoll}. However, epoll also does not handle file system and seems to have problem to linux pipes and \texttt{TTY}s\cit. A very recent alternative that must still be investigated is \texttt{io\_uring}. It claims to address some of the issues with \texttt{epoll} but is too recent to be confident that it does. Finally, a popular cross-platform alternative is \texttt{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than what is needed and could conflict with other features of \CFA unless significant efforts are made to merge them together.
171
172% \paragraph{Event-Engine}
173% Laying on top of the asynchronous interface layer is the event-engine. This engine is responsible for multiplexing (batching) the synchronous I/O requests into an asynchronous I/O request and demultiplexing the results onto appropriate blocked threads. This can be straightforward for the simple cases, but can become quite complex. Decisions that will need to be made include : whether to poll from a seperate kernel thread or a regularly scheduled user thread, what should be the ordering used when results satisfy many requests, how to handle threads waiting for multiple operations, etc.
174
175% \paragraph{Interface}
176% Finally, for these components to be available, it is necessary to expose them through a synchronous interface. This can be a novel interface but it is preferrable to attempt to intercept the existing POSIX interface in order to be compatible with existing code. This will allow C programs written using this interface to be transparently converted to \CFA with minimal effeort. Where this is not applicable, a novel interface will be created to fill the gaps.
177
178
179% % ===============================================================================
180% % ===============================================================================
181% \section{Discussion}
182
183
184% % ===============================================================================
185% % ===============================================================================
186% \section{Timeline}
187
188
189\cleardoublepage
190
191% B I B L I O G R A P H Y
192% -----------------------------
194\bibliographystyle{plain}
195\bibliography{pl,local}
196\cleardoublepage
197\phantomsection         % allows hyperref to link to the correct page
198
199% G L O S S A R Y
200% -----------------------------