Changeset 5569a31 for doc/theses/thierry_delisle_PhD/comp_II
- Timestamp:
- Apr 15, 2020, 11:05:25 AM (4 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- 4ea5308
- Parents:
- 34d0a28
- Location:
- doc/theses/thierry_delisle_PhD/comp_II
- Files:
-
- 3 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex
r34d0a28 r5569a31 3 3 \usepackage[T1]{fontenc} 4 4 \usepackage[utf8]{inputenc} 5 \usepackage{listings} % for code listings6 5 \usepackage{xspace} 7 6 \usepackage{xcolor} 8 7 \usepackage{graphicx} 9 8 \usepackage{epic,eepic} 9 \usepackage{listings} % for code listings 10 10 \usepackage{glossaries} 11 11 \usepackage{textcomp} 12 % cfa macros used in the document 13 \input{common} 14 15 \setlist{topsep=6pt,parsep=0pt} % global reduce spacing between points 16 \newcommand{\uC}{$\mu$\CC} 12 17 \usepackage[hidelinks]{hyperref} 18 \setlength{\abovecaptionskip}{5pt plus 3pt minus 2pt} 19 \lstMakeShortInline$% % single-character for \lstinline 13 20 %\usepackage[margin=1in]{geometry} 14 21 %\usepackage{float} 15 22 16 % cfa macros used in the document17 \input{common}18 23 \input{glossary} 19 24 … … 27 32 28 33 \author{ 29 \huge Thierry Delisle \ \30 \Large \ vspace*{0.1in} \texttt{tdelisle@uwaterloo.ca} \\34 \huge Thierry Delisle \vspace*{5pt} \\ 35 \Large \texttt{tdelisle@uwaterloo.ca} \vspace*{5pt} \\ 31 36 \Large Cheriton School of Computer Science \\ 32 37 \Large University of Waterloo … … 42 47 43 48 \newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace} 44 \newcommand{\TODO}{ ~\newline{\large\bf\color{red} TODO :}\xspace}49 \newcommand{\TODO}{{\large\bf\color{red} TODO: }\xspace} 45 50 46 51 % =============================================================================== … … 54 59 \section{Introduction} 55 60 \subsection{\CFA and the \CFA concurrency package} 56 \CFA\cit is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language. It aims to add high-productivity features while maintaning the predictible performance of C. As such, concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. \CFA concurrrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such, the \CFA \emph{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 57 58 Scheduling occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively. The cost of switching between two threads for an indirect handoff has two components : the cost of actually context-switching, i.e., changing the relevant registers to move execution from one thread to the other, and the cost of scheduling, i.e., deciding which thread to run next among all the threads ready to run. The first cost is generally constant and fixed\footnote{Affecting the context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}, while the scheduling cost can vary based on the system state. Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, i.e. \textit{linearizability}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention. 59 60 The more threads switch, the more the administration cost of scheduling becomes noticeable. It is therefore important to build a scheduler with the lowest possible cost and latency. Another important consideration is \emph{fairness}. In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows to much unfairness. Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the the express cash register at a grocery store. 61 62 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. Therefore, the main goal of this proposal is : 61 \CFA\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language. 62 It aims to add high-productivity features while maintaining the predictable performance of C. 63 As such, concurrency in \CFA\cite{Delisle19} aims to offer simple and safe high-level tools while still allowing performant code. 64 \CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. 65 As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 66 67 \newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. 68 This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively. 69 The cost of switching between two threads for an indirect handoff has two components: 70 \begin{enumerate} 71 \item 72 the cost of actually context-switching, \ie changing the relevant registers to move execution from one thread to the other, 73 \item 74 and the cost of scheduling, \ie deciding which thread to run next among all the threads ready to run. 75 \end{enumerate} 76 The first cost is generally constant and fixed\footnote{Affecting the constant context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}, while the scheduling cost can vary based on the system state. 77 Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention. 78 79 The more threads switch, the more the administration cost of scheduling becomes noticeable. 80 It is therefore important to build a scheduler with the lowest possible cost and latency. 81 Another important consideration is \newterm{fairness}. 82 In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. 83 While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness. 84 Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. 85 In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the the express cash-register at a grocery store. 86 87 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. 88 Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. 89 Therefore, the main goal of this proposal is : 63 90 \begin{quote} 64 91 The \CFA scheduler should be \emph{viable} for \emph{any} workload. 65 92 \end{quote} 66 93 67 For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads. As such, scheduling performance is generally either defined by the best case scenario, i.e., a workload to which the scheduler is tailored, or the worst case scenario, i.e., the scheduler behaves no worst than \emph{X}. For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. Be cause there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler. As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal. 68 69 Finally, the scheduling objective includes producing a scheduling strategy with sufficient fairness guarantees, creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily, scheduling blocking I/O operations, and writing sufficient library tools to allow developers to indirectly use the scheduler. 94 For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads. 95 As such, scheduling performance is generally either defined by the best case scenario, \ie a workload to which the scheduler is tailored, or the worst case scenario, \ie the scheduler behaves no worst than \emph{X}. 96 For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. 97 Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler. 98 As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal. 99 100 To achieve the \CFA scheduling goal includes: 101 \begin{enumerate} 102 \item 103 producing a scheduling strategy with sufficient fairness guarantees, 104 \item 105 creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily, 106 \item 107 scheduling blocking I/O operations, 108 \item 109 and writing sufficient library tools to allow developers to indirectly use the scheduler, either through tuning knobs or replacing the default scheduler. 110 \end{enumerate} 70 111 71 112 % =============================================================================== … … 73 114 74 115 \section{\CFA Scheduling} 75 To scheduler user-level threads across all workloads, the scheduler has a number of requirements: 76 77 \paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. The scheduler cannot allow threads to be dropped from the ready-queue, i.e., scheduled but never run, or be executed multiple times when only being scheduled once. Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup. The \CFA scheduler must be correct. 78 79 \paragraph{Performance} The performance of a scheduler can generally be mesured in terms of scheduling cost, scalability and latency. Scheduling cost is the cost to switch from one thread to another, as mentioned above. For simple applications where a single kernel thread does most of the scheduling, it is generally the dominating cost. When adding many kernel threads, scalability becomes an issue, effectively increasing the cost of context-switching when contention is high. Finally, a third axis of performance is tail latency. This measurement is related to fairness and mesures how long is needed for a thread to be run once scheduled and is evaluated in the worst cases. The \CFA scheduler should offer good performance in all three metrics. 80 81 \paragraph{Fairness} Like performance, this requirements has several aspect : eventual progress, predictability and performance reliablility. As a hard requirement, the \CFA scheduler must guarantee eventual progress, i.e., prevent starvation, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about. Beyond this requirement, performance should be predictible and reliable, which means similar workloads achieve similar performance and programmer intuition is respected. An example of this is : a thread that yields agressively should not run more often then other tasks. While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. The \CFA scheduler must guarantee eventual progress and should be predictible and offer reliable performance. 82 83 \paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement. This issue is discussed more in depth towards the end of this proposal. It effectively refers to avoiding using CPU power when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it. Balancing these two states is where the complexity lies. The \CFA scheduler should be efficient with respect to the underlying (shared) computer. 116 To schedule user-level threads across all workloads, the scheduler has a number of requirements: 117 118 \paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. 119 The scheduler cannot allow threads to be dropped from the ready queue, \ie scheduled but never run, or be executed multiple times when only being scheduled once. 120 Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup. 121 The \CFA scheduler must be correct. 122 123 \paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency. 124 \newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above. 125 For simple applications, where a single kernel thread does most of the scheduling, it is generally the dominating cost. 126 \newterm{Scalability} is the cost of adding multiple kernel threads because it increases the time for context switching because of contention by multiple threads accessing shared resources, \eg the ready queue. 127 Finally, \newterm{tail latency} is service delay and relates to thread fairness. 128 Specifically, latency measures how long a thread waits to run once scheduled and is evaluated in the worst case. 129 The \CFA scheduler should offer good performance for all three metrics. 130 131 \paragraph{Fairness} Like performance, this requirement has several aspect : eventual progress, predictability and performance reliability. 132 \newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation. 133 As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about. 134 \newterm{Predictability} and \newterm{reliability} means similar workloads achieve similar performance and programmer execution intuition is respected. 135 For example, a thread that yields aggressively should not run more often then other tasks. 136 While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. 137 The \CFA scheduler must guarantee eventual progress and should be predictable and offer reliable performance. 138 139 \paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal. 140 \newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it. 141 Balancing these two states is where the complexity lies. 142 The \CFA scheduler should be efficient with respect to the underlying (shared) computer. 84 143 85 144 \bigskip To achieve these requirements, I can reject two broad types of scheduling strategies : feedback-based and priority schedulers. 86 145 87 146 \subsection{Feedback-Based Schedulers} 88 Many operating systems use schedulers based on feedback in some form, e.g., measuring how much CPU a particular thread has used\footnote{Different metrics can measured here but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions on the workload : 147 Many operating systems use schedulers based on feedback in some form, \eg measuring how much CPU a particular thread has used\footnote{Different metrics can be measured but it is not relevant to the discussion.} and schedule threads based on this metric. 148 These strategies are sensible for operating systems but rely on two assumptions for the workload: 89 149 90 150 \begin{enumerate} … … 93 153 \end{enumerate} 94 154 95 While these two assumptions generally hold for operating systems, they may not for user-level threading. Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetime, which are only scheduled a few times. Scheduling strategies based on feedback cannot be effective in these cases because they do not have the opportunity to measure the metrics that underlie the algorithm. Note that the problem of feedback convergence (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, e.g., threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 96 97 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible that some users are malevolent and try to exploit the scheduling strategy in order to achieve some nefarious objective. Security concerns mean that more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. In the case of the \CFA scheduler, every thread runs in the same user-space and is controlled by the same user. Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. This approach allows for a much simpler fairness metric and in this proposal ``fairness'' is considered as follows : when multiple threads are cycling through the system, the total ordering of threads being scheduled, i.e., pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, i.e., popped from the ready-queue. 98 99 Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next. 155 While these two assumptions generally hold for operating systems, they may not for user-level threading. 156 Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetimes that are only scheduled a few times. 157 Scheduling strategies based on feedback cannot be effective in these cases because there is no opportunity to measure the metrics that underlie the algorithm. 158 Note, the problem of \newterm{feedback convergence} (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, \eg threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 159 160 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. 161 When multiple users are involved, it is possible some users are malevolent and try to exploit the scheduling strategy to achieve some nefarious objective. 162 Security concerns mean more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. 163 In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user. 164 Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. 165 This approach allows for a much simpler fairness metric and in this proposal \emph{fairness} is defined as: when multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready-queue. 166 167 Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. 168 Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next. 100 169 101 170 \subsection{Priority Schedulers} 102 Another broad category of schedulers are priority schedulers. In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. These priority mean that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems. 103 104 An important observation to make is that threads do not need to have explicit priorities for problems to occur. Indeed, any system with multiple ready-queues and attempts to exhaust one queue before accessing the other queues, can encounter starvation problems. A popular scheduling strategy that suffers from implicit priorities is work-stealing. Work-stealing is generally presented as follows, each processor has a list of ready threads. 105 \begin{enumerate} 106 \item Run threads from ``this'' processor's list first. 107 \item If ``this'' processor's list is empty, run threads from some other processor's list. 108 \end{enumerate} 109 110 In a loaded system\footnote{A loaded system is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block or preempt for an extended period of time, threads on the same processor's list starve if no other processors exhaust their list. 111 112 Since priorities can be complex for programmers to handle, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities. 171 Another broad category of schedulers are priority schedulers. 172 In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. 173 Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. 174 A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. 175 This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems. 176 177 An important observation is that threads do not need to have explicit priorities for problems to occur. 178 Indeed, any system with multiple ready-queues that attempts to exhaust one queue before accessing the other queues, essentially provide implicit priority, which can encounter starvation problems. 179 For example, a popular scheduling strategy that suffers from implicit priorities is work stealing. 180 \newterm{Work stealing} is generally presented as follows: 181 \begin{enumerate} 182 \item Each processor has a list of ready threads. 183 \item Each processor runs threads from its ready queue first. 184 \item If a processor's ready queue is empty, attempt to run threads from some other processor's ready queue. 185 \end{enumerate} 186 187 In a loaded system\footnote{A \newterm{loaded system} is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block, or preempt for an extended period of time, threads on the same processor's list starve if no other processors exhaust their list. 188 189 Since priorities can be complex for programmers to incorporate into their execution intuition, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities. 113 190 114 191 \subsection{Schedulers without feedback or priorities} 115 This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is FIFO ordering, i.e., threads scheduled first run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additionnal synchronization. Thankfully, strict FIFO is not needed for sufficient fairness. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. This relaxation is possible because the non-determinism means that programmers must already handle ordering problems in order to produce correct code and already must rely on weak guarantees, for example that a specific thread will \emph{eventually} run. Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering. 116 117 Scheduling is defined as follows : 192 This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. 193 The simplest fairness guarantee is FIFO ordering, \ie threads scheduled first run first. 194 However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additional synchronization. 195 Thankfully, strict FIFO is not needed for sufficient fairness. 196 Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. 197 Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a specific thread will \emph{eventually} run. 198 Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. 199 For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering. 200 201 The \CFA scheduler fairness is defined as follows: 118 202 \begin{itemize} 119 203 \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$. 120 204 \end{itemize} 121 122 205 While this is not a bounded guarantee, the probability that unfairness persist for long periods of times decreases exponentially, making persisting unfairness virtually impossible. 123 206 124 207 % =============================================================================== 125 208 % =============================================================================== 126 \section{Proposal} 127 128 \subsection{Ready-Queue} \label{sec:queue} 129 A simple ready-queue can be built from a FIFO queue, where user-threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user-threads from the queue and execute them. Using the paper\cite{alistarh2018relaxed} as a basis, it is simple to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant. Section~\ref{sec:resize} discusses resizing the array.}. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp. A higher number of underlying queues leads to less contention on each queue and therefore better performance. In a loaded system, it is highly likely the queues are non-empty, i.e., several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is highly likely to yield a queue with available items. In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10. 209 \section{Proposal Details} 210 211 \subsection{Central Ready Queue} \label{sec:queue} 212 A central ready queue can be built from a FIFO queue, where user threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user threads from the queue and execute them. 213 Alistarh \etal~\cite{alistarh2018relaxed} show it is straightforward to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. 214 The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant. 215 Section~\ref{sec:resize} discusses resizing the array.}. 216 Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue. 217 Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp. 218 A higher number of underlying queues leads to less contention on each queue and therefore better performance. 219 In a loaded system, it is highly likely the queues are non-empty, \ie several tasks are on each of the underlying queues. 220 This means that selecting a queue at random to pop from is highly likely to yield a queue with available items. 221 In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10. 130 222 131 223 \begin{figure} … … 133 225 \input{base} 134 226 \end{center} 135 \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. The timestamp is in all nodes and cell arrays.} 227 \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. 228 The timestamp is in all nodes and cell arrays.} 136 229 \label{fig:base} 137 230 \end{figure} … … 145 238 \end{figure} 146 239 147 When the ready queue is \emph{more empty}, i.e., several of the queues are empty, selecting a random queue for popping is less likely to yield a valid selection and more attempts need to be made, resulting in a performance degradation. Figure~\ref{fig:empty} shows an example with fewer elements where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time. Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density. This leads to four performance cases, as depicted in Table~\ref{tab:perfcases}. 240 When the ready queue is \emph{more empty}, \ie several of the queues are empty, selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation. 241 Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time. 242 Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. 243 Note, the popping kernel thread has no work to do, but CPU cycles are wasted both for available user and kernel threads during the pop operation as the popping thread is using a CPU. 244 Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density. 245 246 This leads to four performance cases for the centralized ready-queue, as depicted in Table~\ref{tab:perfcases}. 247 The number of processors (many or few) refers to the number of kernel threads \emph{actively} attempting to pop user threads from the queues, not the total number of kernel threads. 248 The number of threads (many or few) refers to the number of user threads ready to be run. 249 Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty. 250 Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}. 148 251 149 252 \begin{table} … … 155 258 Many Threads & A: good performance & B: good performance \\ 156 259 \hline 157 Few Threads & C: poorperformance & D: poor performance \\260 Few Threads & C: worst performance & D: poor performance \\ 158 261 \hline 159 262 \end{tabular} 160 263 \end{center} 161 \caption{ Performance of the relaxed FIFO list in different cases. The number of processors (many or few) refers to the number of kernel-threads \emph{actively} attempting to pop user-threads from the queues, not the total number of kernel-threads. The number of threads (many or few) refers to the number of user-threads ready to be run. Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty. Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}.}264 \caption{Expected performance of the relaxed FIFO list in different cases.} 162 265 \label{tab:perfcases} 163 266 \end{table} 164 267 165 Table~\ref{tab:perfcases} 166 167 Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used. This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations. The approach used to encode this information can vary in density and be either global or local, where density means the information is either packed in few cachelines or spread across several cachelines, and local information means each thread uses an independent copy instead of a single global, i.e., common, source of information. 168 169 For example, bitmask can be used to identify which inner queues are currently in use, as shown in Figure~\ref{fig:emptybit}. This means that processors can often find user-threads in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) which allow using the bitmask with very little overhead compared to the randomized selection approach for a filled readyqueue, offerring decent performance even in cases with many empty inner queues. However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomicly.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomicly. 170 171 Finally, a dense bitmap, either single or multi-word, causes additional problems 172 in case C (Table 1), because many processors are continuously scanning the 173 bitmask to find the few available threads. This increased contention on the 174 bitmask(s) reduces performance because of cache misses and the bitmask is 175 updated more frequently by the scanning processors racing to read and/or update 176 that information. This increased update frequency means the information in the 177 bitmask will more often be stale before a processor can use it to find an item. 268 Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used. 269 This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations. 270 The approach used to encode this information can vary in density and be either global or local. 271 \newterm{Density} means the information is either packed in a few cachelines or spread across several cachelines, and \newterm{local information} means each thread uses an independent copy instead of a single global, \ie common, source of information. 272 273 For example, Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. 274 This approach means processors can often find user threads in constant time, regardless of how many underlying queues are empty. 275 Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. 276 However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. 277 With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically. 278 279 Finally, a dense bitmap, either single or multi-word, causes additional problems in case C (Table 1), because many processors are continuously scanning the bitmask to find the few available threads. 280 This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information. 281 This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available user threads but none on queue. 178 282 179 283 \begin{figure} … … 185 289 \end{figure} 186 290 187 Another approach is to use a hiearchical data structure, for example Figure~\ref{fig:emptytree}. Creating a tree of nodes to reduce contention has been shown to work in similar cases\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case. 291 Figure~\ref{fig:emptytree} shows another approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US. 292 How does that affect \CFA? Can I use it in my work?}. 293 However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case. 188 294 189 295 \begin{figure} … … 195 301 \end{figure} 196 302 197 Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independant copies of it. While this approach can offer good scalability \emph{and} low latency, the livelyness of the information can become a problem. In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation. A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. As mentionned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct. 198 199 \begin{figure} 200 \begin{center} 201 {\resizebox{0.8\textwidth}{!}{\input{emptytls}}} 303 Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independent copy of it. 304 While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem. 305 In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation. 306 A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. 307 As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. 308 Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct. 309 310 \begin{figure} 311 \begin{center} 312 \input{emptytls} 202 313 \end{center} 203 314 \caption{``More empty'' queue with added per processor bitmask to indicate which array cells have items.} … … 205 316 \end{figure} 206 317 207 There is a fundamental tradeoff among these approach. Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case. Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hiearchical structures, e.g., binary search tree, effectively aggregate information but following pointer chains, learning information for each node. Similarly, other sparse schemes would need to read multiple cachelines to acquire all the information needed.}. Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness. The fact that these solutions have these fundamental limits suggest to me that a better solution combines these properties in an interesting ways. The lock discussed in Section~\ref{sec:resize} also allows for solutions that adapt to the number of processors, which could also prove useful. 318 There is a fundamental tradeoff among these approach. 319 Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case. 320 Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node. 321 Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}. 322 Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness. 323 The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting ways. 324 Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful. 208 325 209 326 \paragraph{Objectives and Existing Work} 210 327 211 How much scalability is actually needed is highly debatable \emph{libfibre}\cit has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such, the single atomic instruction on a shared cacheline may be sufficiently performant. 212 213 I have built a prototype of this ready-queue in the shape of a data-queue, i.e., nodes on the queue are structures with a single int and the intrusive data fields. Using this prototype I ran preliminary performance experiments which confirm the expected performance in Table~\ref{tab:perfcases}. However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, e.g., threads are not independant of each other, when a thread blocks some other thread must intervene to wake it. 214 215 I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results. As creating one-to-one comparisons with the prototype will be complex. 328 How much scalability is actually needed is highly debatable. 329 \emph{libfibre}\cite{libfibre} has compared favorably to other schedulers in webserver tests\cite{karstenuser} and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. 330 As such, the single atomic instruction on a shared cacheline may be sufficiently performant. 331 332 I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single int representing a thread and intrusive data fields. 333 Using this prototype I ran preliminary performance experiments that confirm the expected performance in Table~\ref{tab:perfcases}. 334 However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, \eg threads are not independent of each other, when a thread blocks some other thread must intervene to wake it. 335 336 I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results, as creating one-to-one comparisons between the prototype and the \CFA runtime will be complex. 216 337 217 338 \subsection{Dynamic Resizing} \label{sec:resize} 339 218 340 \begin{figure} 219 341 \begin{center} … … 224 346 \end{figure} 225 347 226 The \CFA runtime system groups processors together as clusters. Threads on a cluster are always scheduled on one of the processors of the cluster. Currently, the runtime handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamicaly resizing a cluster is considered a rare event associated with setup, teardown and major configuration changes. This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honor the correctness of these behaviour but does not have any performance objectives with regard to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. However, as mentionned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance. The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it. This can introduce memory reclamation problems if not done correctly. 348 The \CFA runtime system groups processors together as \newterm{clusters}, as shown in Figure~\ref{fig:system}. 349 Threads on a cluster are always scheduled on one of the processors of the cluster. 350 Currently, the runtime handles dynamically adding and removing processors from clusters at any time. 351 Since this is part of the existing design, the proposed scheduler must also support this behaviour. 352 However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes. 353 This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. 354 As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster. 355 How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. 356 However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance. 357 The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. 358 Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it, which can introduce memory reclamation problems if not done correctly. 227 359 228 360 \begin{figure} … … 230 362 \input{resize} 231 363 \end{center} 232 \caption{Copy of data structure shown in Figure~\ref{fig:base}. 364 \caption{Copy of data structure shown in Figure~\ref{fig:base}.} 233 365 \label{fig:base2} 234 366 \end{figure} 235 367 236 It is important to note how the array is used in this case. While the array cells are modified by every push and pop operation, the array itself, i.e., the pointer that would change when resized, is only read during these operations. Therefore the use of this pointer can be described as frequent reads and infrequent writes. This description effectively matches with the description of a Reader-Writer lock, infrequent but invasive updates among frequent read operations. In the case of the Ready-Queue described above, read operations are operations that push or pop from the ready-queue but do not invalidate any references to the ready queue data structures. Writes on the other-hand would add or remove inner queues, invalidating references to the array of inner queues in the process. Therefore, the current proposed approach to this problem is to add a per-cluster Reader Writer lock around the ready queue to prevent restructuring of the ready-queue data structure while threads are being pushed or popped. 237 238 There are possible alternatives to the Reader Writer lock solution. This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject\cit. However, the RWlock solution is simple and can be leveraged to solve other problems (e.g., processor ordering and memory reclamation of threads), which makes it an attractive solution. 368 It is important to note how the array is used in this case. 369 While the array cells are modified by every push and pop operation, the array itself, \ie the pointer that would change when resized, is only read during these operations. 370 Therefore the use of this pointer can be described as frequent reads and infrequent writes. 371 This description effectively matches with the description of a reader-writer lock, infrequent but invasive updates among frequent read operations. 372 In the case of the ready queue described above, read operations are operations that push or pop from the ready queue but do not invalidate any references to the ready queue data structures. 373 Writes on the other hand would add or remove inner queues, invalidating references to the array of inner queues in a process. 374 Therefore, the current proposed approach to this problem is to add a per-cluster reader-writer lock around the ready queue to prevent restructuring of the ready-queue data-structure while threads are being pushed or popped. 375 376 There are possible alternatives to the reader-writer lock solution. 377 This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject\cite{michael2004hazard, brown2015reclaiming}. 378 However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution. 239 379 240 380 \paragraph{Objectives and Existing Work} 241 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk on this project. 381 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck. 382 I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. 383 As such, I do not consider this lock to be a risk for this project. 242 384 243 385 \subsection{Idle Sleep} \label{sec:sleep} 244 As mentioned, idle sleep is the process of putting processors to sleep when they have no threads to execute. In this context, processors are kernel-threads and sleeping refers to asking the kernel to block a thread. This benefit can be achieved with either thread synchronization operations like pthread\_cond\_wait or using signal operations like sigsuspend. The goal of putting idle processors to sleep is two-fold, it reduces energy consumption in cases where more idle kernel-threads translate to idle hardware threads, and reduces contention on the ready queue, since the otherwise idle processors generally contend trying to pop items from the queue. 245 246 Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake-up. 247 248 When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This is equivalent to the classic problem of missing signals when using condition variables: the ``sleepy'' processor indicates that it will sleep but has not yet gone to sleep, when another processor attempts to wake it up, the waking-up operation may claim nothing needs to be done and the signal is missed. In cases where threads are scheduled from processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually. Individual processors always finish scheduling threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready-queue is not linearizable). However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking where it should not\footnote{Clusters ``should'' never deadlock, but for this proposal, cases where \CFA users \emph{actually} write \CFA code that leads to a deadlock is considered as a deadlock that ``should'' happen. }. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleepy processor notices that a thread is scheduled after it signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthreads offers few tools to implement this solution and offers no guarantee of ordering of threads waking up for most of these tools. 249 250 Another issues is trying to avoid kernel threads sleeping and waking frequently. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. Note that while this generally means organising the processors in a stack, I believe that the unique index provided by the ReaderWriter lock can be reused to strictly order the waking order of processors, causing a LIFO like waking order. While a strict LIFO stack is probably better, using the processor index could prove useful and offer a sufficiently LIFO ordering. 251 252 Finally, another important aspect of Idle Sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up. Processors that are unnecessarily unblocked lead to unnecessary contention and power consumption, while too many sleeping processors can lead to sub-optimal throughput. Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. There is already a wealth of research on the subject and I may use an existing approach for the Idle Sleep heuristic in this project. 386 387 \newterm{Idle sleep} is the process of putting processors to sleep when they have no threads to execute. 388 In this context, processors are kernel threads and sleeping refers to asking the kernel to block a thread. 389 This operation can be achieved with either thread synchronization operations like $pthread_cond_wait$ or using signal operations like $sigsuspend$. 390 The goal of putting idle processors to sleep is: 391 \begin{enumerate} 392 \item 393 reduce contention on the ready queue, since the otherwise idle processors generally contend trying to pop items from the queue, 394 \item 395 give back unneeded CPU time associated with a process to other user processors executing on the computer, 396 \item 397 and reduce energy consumption in cases where more idle kernel-threads translate to idle CPUs, which can cycle down. 398 \end{enumerate} 399 Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake up. 400 401 When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. 402 This operation is equivalent to the classic problem of missing signals when using condition variables: the ``sleepy'' processor indicates its intention to block but has not yet gone to sleep when another processor attempts to wake it up. 403 The waking-up operation sees the blocked process and signals it, but the blocking process is racing to sleep so the signal is missed. 404 In cases where kernel threads are managed as processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually. 405 Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable). 406 However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster. 407 In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} write \CFA code that leads to a deadlock.}. 408 Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. 409 For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. 410 To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. 411 This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools. 412 413 Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost. 414 This scenario happens when a program oscillates between high and low activity, needing most and then less processors. 415 A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. 416 This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. 417 Note that while this generally means organizing the processors in a stack, I believe that the unique index provided in my reader-writer lock can be reused to strictly order the waking processors, causing a mostly LIFO order. 418 While a strict LIFO stack is probably better, the processor index could prove useful for other reasons, while still offering a sufficiently LIFO ordering. 419 420 A final important aspect of idle sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up. 421 Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to sub-optimal throughput. 422 Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. 423 There is already a wealth of research on the subject\cite{schillings1996engineering, wiki:thunderherd} and I may use an existing approach for the idle-sleep heuristic in this project, \eg\cite{karstenuser}. 253 424 254 425 \subsection{Asynchronous I/O} 255 The final aspect of this proposal is asynchronous I/O. Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput. It would be preferrable to block the user-thread and reuse the underlying kernel-thread to run other ready threads. This approach requires intercepting the user-threads' calls to I/O operations, redirecting them to an asynchronous I/O interface and handling the multiplexing between the synchronous and asynchronous API. As such, these are the three components needed to implemented support for asynchronous I/O : an OS abstraction layer over the asynchronous interface, an event-engine to (de)multiplex the operations and a synchronous interface for users to use. None of these components currently exist in \CFA and I will need to build all three for this project. 426 427 The final aspect of this proposal is asynchronous I/O. 428 Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput. 429 It is preferable to block the user thread performing the I/O and reuse the underlying kernel-thread to run other ready user threads. 430 This approach requires intercepting user-thread calls to I/O operations, redirecting them to an asynchronous I/O interface, and handling the multiplexing/demultiplexing between the synchronous and asynchronous API. 431 As such, there are three components needed to implemented support for asynchronous I/O: 432 \begin{enumerate} 433 \item 434 an OS abstraction layer over the asynchronous interface, 435 \item 436 an event-engine to (de)multiplex the operations, 437 \item 438 and a synchronous interface for users to use. 439 \end{enumerate} 440 None of these components currently exist in \CFA and I will need to build all three for this project. 256 441 257 442 \paragraph{OS Abstraction} 258 One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API. I plan to use an existing one that is sufficient. uC++ uses the \texttt{select} as its interface, which handles ttys, pipes and sockets, but not disk. It entails significant complexity and is being replaced, which make it a less interesting alternative. Another popular interface is \texttt{epoll}\cit, which is supposed to be cheaper than \texttt{select}. However, epoll also does not handle the file system and seems to have problem to linux pipes and \texttt{TTY}s\cit. A very recent alternative that must still be investigated is \texttt{io\_uring}. It claims to address some of the issues with \texttt{epoll} but is too recent to be confident that it does. Finally, a popular cross-platform alternative is \texttt{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than what is needed and could conflict with other features of \CFA unless significant effort is made to merge them together. 259 260 \paragraph{Event-Engine} 261 Laying on top of the asynchronous interface layer is the event-engine. This engine is responsible for multiplexing (batching) the synchronous I/O requests into an asynchronous I/O request and demultiplexing the results onto appropriate blocked threads. This step can be straightforward for the simple cases, but can become quite complex. Decisions that need to be made include : whether to poll from a seperate kernel thread or a regularly scheduled user thread, what should be the ordering used when results satisfy many requests, how to handle threads waiting for multiple operations, etc. 443 One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations. 444 While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API. 445 It is sufficient to make one work in the complex context of the \CFA runtime. 446 \uC uses the $select$\cite{select} as its interface, which handles ttys, pipes and sockets, but not disk. 447 $select$ entails significant complexity and is being replaced in UNIX operating-systems, which make it a less interesting alternative. 448 Another popular interface is $epoll$\cite{epoll}, which is supposed to be cheaper than $select$. 449 However, $epoll$ also does not handle the file system and seems to have problem to linux pipes and $TTY$s\cit. 450 A very recent alternative that I am investigating is $io_uring$\cite{io_uring}. 451 It claims to address some of the issues with $epoll$ but is too recent to be confident that it does. 452 Finally, a popular cross-platform alternative is $libuv$\cite{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features). 453 However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together. 454 455 \paragraph{Event Engine} 456 Laying on top of the asynchronous interface layer is the event engine. 457 This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user threads. 458 This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors. 459 Decisions that need to be made include: 460 \begin{enumerate} 461 \item 462 whether to poll from a separate kernel thread or a regularly scheduled user thread, 463 \item 464 what should be the ordering used when results satisfy many requests, 465 \item 466 how to handle threads waiting for multiple operations, etc. 467 \end{enumerate} 262 468 263 469 \paragraph{Interface} 264 Finally, for these components to be available, it is necessary to expose them through a synchronous interface. The interface can be novel but it is preferrable to match the existing POSIX interface in order to be compatible with existing code. Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effeort. Where this is not applicable, a novel interface will be created to fill the gaps. 470 Finally, for these non-blocking I/O components to be available, it is necessary to expose them through a synchronous interface because that is the \CFA concurrent programming style. 471 The interface can be novel but it is preferable to match the existing POSIX interface when possible to be compatible with existing code. 472 Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort. 473 Where new functionality is needed, I will create a novel interface to fill gaps and provide advanced features. 265 474 266 475 -
doc/theses/thierry_delisle_PhD/comp_II/comp_II_PAB.tex
r34d0a28 r5569a31 59 59 \section{Introduction} 60 60 \subsection{\CFA and the \CFA concurrency package} 61 \CFA\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language. It aims to add high-productivity features while maintaining the predictable performance of C. As such, concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. \CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 62 63 \newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively. The cost of switching between two threads for an indirect handoff has two components: 61 \CFA\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language. 62 It aims to add high-productivity features while maintaining the predictable performance of C. 63 As such, concurrency in \CFA\cit aims to offer simple and safe high-level tools while still allowing performant code. 64 \CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programing. 65 As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}. 66 67 \newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler. 68 This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively. 69 The cost of switching between two threads for an indirect handoff has two components: 64 70 \begin{enumerate} 65 71 \item … … 68 74 and the cost of scheduling, \ie deciding which thread to run next among all the threads ready to run. 69 75 \end{enumerate} 70 The first cost is generally constant and fixed\footnote{Affecting the constant context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}, while the scheduling cost can vary based on the system state. Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability} meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result, and a new dimension to performance: scalability, where scheduling cost now also depends on contention. 71 72 The more threads switch, the more the administration cost of scheduling becomes noticeable. It is therefore important to build a scheduler with the lowest possible cost and latency. Another important consideration is \newterm{fairness}. In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness. Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the the express cash-register at a grocery store. 73 74 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. Therefore, the main goal of this proposal is : 76 The first cost is generally constant and fixed\footnote{Affecting the constant context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a fixed third-thread before scheduling.}, while the scheduling cost can vary based on the system state. 77 Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention. 78 79 The more threads switch, the more the administration cost of scheduling becomes noticeable. 80 It is therefore important to build a scheduler with the lowest possible cost and latency. 81 Another important consideration is \newterm{fairness}. 82 In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}. 83 While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness. 84 Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance. 85 In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the the express cash-register at a grocery store. 86 87 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance. 88 Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package. 89 Therefore, the main goal of this proposal is : 75 90 \begin{quote} 76 91 The \CFA scheduler should be \emph{viable} for \emph{any} workload. 77 92 \end{quote} 78 93 79 For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads. As such, scheduling performance is generally either defined by the best case scenario, \ie a workload to which the scheduler is tailored, or the worst case scenario, \ie the scheduler behaves no worst than \emph{X}. For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler. As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal. 94 For a general purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads. 95 As such, scheduling performance is generally either defined by the best case scenario, \ie a workload to which the scheduler is tailored, or the worst case scenario, \ie the scheduler behaves no worst than \emph{X}. 96 For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance. 97 Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler. 98 As such, it is important that only programmers with exceptionally high performance requirements should need to write their own scheduler and replace the scheduler in this proposal. 80 99 81 100 To achieve the \CFA scheduling goal includes: … … 97 116 To schedule user-level threads across all workloads, the scheduler has a number of requirements: 98 117 99 \paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. The scheduler cannot allow threads to be dropped from the ready queue, \ie scheduled but never run, or be executed multiple times when only being scheduled once. Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup. The \CFA scheduler must be correct. 100 101 \paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency. \newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above. For simple applications, where a single kernel thread does most of the scheduling, it is generally the dominating cost. \newterm{Scalability} is the cost of adding multiple kernel threads because it increases the time for context switching because of contention by multiple threads accessing shared resources, \eg the ready queue. Finally, \newterm{tail latency} is service delay and relates to thread fairness. Specifically, latency measures how long a thread waits to run once scheduled and is evaluated in the worst case. The \CFA scheduler should offer good performance for all three metrics. 102 103 \paragraph{Fairness} Like performance, this requirement has several aspect : eventual progress, predictability and performance reliability. \newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation. As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about. \newterm{Predictability} and \newterm{reliability} means similar workloads achieve similar performance and programmer execution intuition is respected. For example, a thread that yields aggressively should not run more often then other tasks. While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. The \CFA scheduler must guarantee eventual progress and should be predictable and offer reliable performance. 104 105 \paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal. \newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it. Balancing these two states is where the complexity lies. The \CFA scheduler should be efficient with respect to the underlying (shared) computer. 118 \paragraph{Correctness} As with any other concurrent data structure or algorithm, the correctness requirement is paramount. 119 The scheduler cannot allow threads to be dropped from the ready queue, \ie scheduled but never run, or be executed multiple times when only being scheduled once. 120 Since \CFA concurrency has no spurious wakeup, this definition of correctness also means the scheduler should have no spurious wakeup. 121 The \CFA scheduler must be correct. 122 123 \paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency. 124 \newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above. 125 For simple applications, where a single kernel thread does most of the scheduling, it is generally the dominating cost. 126 \newterm{Scalability} is the cost of adding multiple kernel threads because it increases the time for context switching because of contention by multiple threads accessing shared resources, \eg the ready queue. 127 Finally, \newterm{tail latency} is service delay and relates to thread fairness. 128 Specifically, latency measures how long a thread waits to run once scheduled and is evaluated in the worst case. 129 The \CFA scheduler should offer good performance for all three metrics. 130 131 \paragraph{Fairness} Like performance, this requirement has several aspect : eventual progress, predictability and performance reliability. 132 \newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation. 133 As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about. 134 \newterm{Predictability} and \newterm{reliability} means similar workloads achieve similar performance and programmer execution intuition is respected. 135 For example, a thread that yields aggressively should not run more often then other tasks. 136 While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers. 137 The \CFA scheduler must guarantee eventual progress and should be predictable and offer reliable performance. 138 139 \paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal. 140 \newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it. 141 Balancing these two states is where the complexity lies. 142 The \CFA scheduler should be efficient with respect to the underlying (shared) computer. 106 143 107 144 \bigskip To achieve these requirements, I can reject two broad types of scheduling strategies : feedback-based and priority schedulers. 108 145 109 146 \subsection{Feedback-Based Schedulers} 110 Many operating systems use schedulers based on feedback in some form, \eg measuring how much CPU a particular thread has used\footnote{Different metrics can be measured but it is not relevant to the discussion.} and schedule threads based on this metric. These strategies are sensible for operating systems but rely on two assumptions for the workload: 147 Many operating systems use schedulers based on feedback in some form, \eg measuring how much CPU a particular thread has used\footnote{Different metrics can be measured but it is not relevant to the discussion.} and schedule threads based on this metric. 148 These strategies are sensible for operating systems but rely on two assumptions for the workload: 111 149 112 150 \begin{enumerate} … … 115 153 \end{enumerate} 116 154 117 While these two assumptions generally hold for operating systems, they may not for user-level threading. Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetimes that are only scheduled a few times. Scheduling strategies based on feedback cannot be effective in these cases because there is no opportunity to measure the metrics that underlie the algorithm. Note, the problem of \newterm{feedback convergence} (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, \eg threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 118 119 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. When multiple users are involved, it is possible some users are malevolent and try to exploit the scheduling strategy to achieve some nefarious objective. Security concerns mean more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user. Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. This approach allows for a much simpler fairness metric and in this proposal \emph{fairness} is defined as: when multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready-queue. 120 121 Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next. 155 While these two assumptions generally hold for operating systems, they may not for user-level threading. 156 Since \CFA has the explicit goal of allowing many smaller threads, this can naturally lead to threads with much shorter lifetimes that are only scheduled a few times. 157 Scheduling strategies based on feedback cannot be effective in these cases because there is no opportunity to measure the metrics that underlie the algorithm. 158 Note, the problem of \newterm{feedback convergence} (reacting too slowly to scheduling events) is not specific to short lived threads but can also occur with threads that show drastic changes in scheduling, \eg threads running for long periods of time and then suddenly blocking and unblocking quickly and repeatedly. 159 160 In the context of operating systems, these concerns can be overshadowed by a more pressing concern : security. 161 When multiple users are involved, it is possible some users are malevolent and try to exploit the scheduling strategy to achieve some nefarious objective. 162 Security concerns mean more precise and robust fairness metrics must be used to guarantee fairness across processes created by users as well as threads created within a process. 163 In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user. 164 Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent. 165 This approach allows for a much simpler fairness metric and in this proposal \emph{fairness} is defined as: when multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready-queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready-queue. 166 167 Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback. 168 Feedback in general is not rejected for secondary concerns like idle sleep for kernel threads, but no feedback is used to decide which thread to run next. 122 169 123 170 \subsection{Priority Schedulers} 124 Another broad category of schedulers are priority schedulers. In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems. 125 126 An important observation is that threads do not need to have explicit priorities for problems to occur. Indeed, any system with multiple ready-queues that attempts to exhaust one queue before accessing the other queues, essentially provide implicit priority, which can encounter starvation problems. For example, a popular scheduling strategy that suffers from implicit priorities is work stealing. \newterm{Work stealing} is generally presented as follows: 171 Another broad category of schedulers are priority schedulers. 172 In these scheduling strategies, threads have priorities and the runtime schedules the threads with the highest priority before scheduling other threads. 173 Threads with equal priority are scheduled using a secondary strategy, often something simple like round-robin or FIFO. 174 A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run. 175 This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems. 176 177 An important observation is that threads do not need to have explicit priorities for problems to occur. 178 Indeed, any system with multiple ready-queues that attempts to exhaust one queue before accessing the other queues, essentially provide implicit priority, which can encounter starvation problems. 179 For example, a popular scheduling strategy that suffers from implicit priorities is work stealing. 180 \newterm{Work stealing} is generally presented as follows: 127 181 \begin{enumerate} 128 182 \item Each processor has a list of ready threads. 129 \item Each processor runs thread from its ready queue first.183 \item Each processor runs threads from its ready queue first. 130 184 \item If a processor's ready queue is empty, attempt to run threads from some other processor's ready queue. 131 185 \end{enumerate} … … 136 190 137 191 \subsection{Schedulers without feedback or priorities} 138 This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. The simplest fairness guarantee is FIFO ordering, \ie threads scheduled first run first. However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additional synchronization. Thankfully, strict FIFO is not needed for sufficient fairness. Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a specific thread will \emph{eventually} run. Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering. 192 This proposal conjectures that is is possible to construct a default scheduler for the \CFA runtime that offers good scalability and a simple fairness guarantee that is easy for programmers to reason about. 193 The simplest fairness guarantee is FIFO ordering, \ie threads scheduled first run first. 194 However, enforcing FIFO ordering generally conflicts with scalability across multiple processors because of the additional synchronization. 195 Thankfully, strict FIFO is not needed for sufficient fairness. 196 Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run. 197 Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a specific thread will \emph{eventually} run. 198 Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems. 199 For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering. 139 200 140 201 The \CFA scheduler fairness is defined as follows: … … 149 210 150 211 \subsection{Central Ready Queue} \label{sec:queue} 151 A central ready queue can be built from a FIFO queue, where user threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user threads from the queue and execute them. Alistarh \etal~\cite{alistarh2018relaxed} show it is straightforward to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant. Section~\ref{sec:resize} discusses resizing the array.}. Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue. Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp. A higher number of underlying queues leads to less contention on each queue and therefore better performance. In a loaded system, it is highly likely the queues are non-empty, \ie several tasks are on each of the underlying queues. This means that selecting a queue at random to pop from is highly likely to yield a queue with available items. In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10. 212 A central ready queue can be built from a FIFO queue, where user threads are pushed onto the queue when they are ready to run, and processors (kernel-threads acting as virtual processors) pop the user threads from the queue and execute them. 213 Alistarh \etal~\cite{alistarh2018relaxed} show it is straightforward to build a relaxed FIFO list that is fast and scalable for loaded or overloaded systems. 214 The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant. 215 Section~\ref{sec:resize} discusses resizing the array.}. 216 Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue. 217 Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp. 218 A higher number of underlying queues leads to less contention on each queue and therefore better performance. 219 In a loaded system, it is highly likely the queues are non-empty, \ie several tasks are on each of the underlying queues. 220 This means that selecting a queue at random to pop from is highly likely to yield a queue with available items. 221 In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10. 152 222 153 223 \begin{figure} … … 155 225 \input{base} 156 226 \end{center} 157 \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. The timestamp is in all nodes and cell arrays.} 227 \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists. 228 The timestamp is in all nodes and cell arrays.} 158 229 \label{fig:base} 159 230 \end{figure} … … 167 238 \end{figure} 168 239 169 When the ready queue is \emph{more empty}, \ie several of the queues are empty, selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation. Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time. Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. Note, the popping kernel thread has no work to do, but CPU cycles are wasted both for available user and kernel threads during the pop operation as the popping thread is using a CPU. Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density. 170 171 This leads to four performance cases for the centralized ready-queue, as depicted in Table~\ref{tab:perfcases}. The number of processors (many or few) refers to the number of kernel threads \emph{actively} attempting to pop user threads from the queues, not the total number of kernel threads. The number of threads (many or few) refers to the number of user threads ready to be run. Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty. Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}. 240 When the ready queue is \emph{more empty}, \ie several of the queues are empty, selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation. 241 Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time. 242 Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry. 243 Note, the popping kernel thread has no work to do, but CPU cycles are wasted both for available user and kernel threads during the pop operation as the popping thread is using a CPU. 244 Overall performance is therefore influenced by the contention on the underlying queues and pop performance is influenced by the item density. 245 246 This leads to four performance cases for the centralized ready-queue, as depicted in Table~\ref{tab:perfcases}. 247 The number of processors (many or few) refers to the number of kernel threads \emph{actively} attempting to pop user threads from the queues, not the total number of kernel threads. 248 The number of threads (many or few) refers to the number of user threads ready to be run. 249 Many threads means they outnumber processors significantly and most underlying queues have items, few threads mean there are barely more threads than processors and most underlying queues are empty. 250 Cases with fewer threads than processors are discussed in Section~\ref{sec:sleep}. 172 251 173 252 \begin{table} … … 179 258 Many Threads & A: good performance & B: good performance \\ 180 259 \hline 181 Few Threads & C: poorperformance & D: poor performance \\260 Few Threads & C: worst performance & D: poor performance \\ 182 261 \hline 183 262 \end{tabular} … … 187 266 \end{table} 188 267 189 Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used. This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations. The approach used to encode this information can vary in density and be either global or local. \newterm{Density} means the information is either packed in a few cachelines or spread across several cachelines, and \newterm{local information} means each thread uses an independent copy instead of a single global, \ie common, source of information. 190 191 For example, Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. This approach means processors can often find user threads in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically. 192 193 Finally, a dense bitmap, either single or multi-word, causes additional problems in case C (Table 1), because many processors are continuously scanning the bitmask to find the few available threads. This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information. This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says available user threads but none on queue. 268 Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used. 269 This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations. 270 The approach used to encode this information can vary in density and be either global or local. 271 \newterm{Density} means the information is either packed in a few cachelines or spread across several cachelines, and \newterm{local information} means each thread uses an independent copy instead of a single global, \ie common, source of information. 272 273 For example, Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. 274 This approach means processors can often find user threads in constant time, regardless of how many underlying queues are empty. 275 Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. 276 However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. 277 With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically. 278 279 Finally, a dense bitmap, either single or multi-word, causes additional problems in case C (Table 1), because many processors are continuously scanning the bitmask to find the few available threads. 280 This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information. 281 This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available user threads but none on queue. 194 282 195 283 \begin{figure} … … 201 289 \end{figure} 202 290 203 Figure~\ref{fig:emptytree} shows another approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US. How does that affect \CFA? Can I use it in my work?}. However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case. 291 Figure~\ref{fig:emptytree} shows another approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US. 292 How does that affect \CFA? Can I use it in my work?}. 293 However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case. 204 294 205 295 \begin{figure} … … 211 301 \end{figure} 212 302 213 Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independent copy of it. While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem. In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation. A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct. 303 Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independent copy of it. 304 While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem. 305 In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation. 306 A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct. 307 As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation. 308 Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct. 214 309 215 310 \begin{figure} … … 221 316 \end{figure} 222 317 223 There is a fundamental tradeoff among these approach. Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case. Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node. Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}. Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness. The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting ways. Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful. 318 There is a fundamental tradeoff among these approach. 319 Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case. 320 Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node. 321 Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}. 322 Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases, however the information can become stale making it difficult to use to ensure correctness. 323 The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting ways. 324 Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful. 224 325 225 326 \paragraph{Objectives and Existing Work} 226 327 227 How much scalability is actually needed is highly debatable. \emph{libfibre}\cit has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. As such, the single atomic instruction on a shared cacheline may be sufficiently performant. 228 229 I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single int representing a thread and intrusive data fields. Using this prototype I ran preliminary performance experiments that confirm the expected performance in Table~\ref{tab:perfcases}. However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, \eg threads are not independent of each other, when a thread blocks some other thread must intervene to wake it. 328 How much scalability is actually needed is highly debatable. 329 \emph{libfibre}\cit has compared favorably to other schedulers in webserver tests\cit and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask. 330 As such, the single atomic instruction on a shared cacheline may be sufficiently performant. 331 332 I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single int representing a thread and intrusive data fields. 333 Using this prototype I ran preliminary performance experiments that confirm the expected performance in Table~\ref{tab:perfcases}. 334 However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, \eg threads are not independent of each other, when a thread blocks some other thread must intervene to wake it. 230 335 231 336 I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results, as creating one-to-one comparisons between the prototype and the \CFA runtime will be complex. … … 241 346 \end{figure} 242 347 243 The \CFA runtime system groups processors together as \newterm{clusters}, as shown in Figure~\ref{fig:system}. Threads on a cluster are always scheduled on one of the processors of the cluster. Currently, the runtime handles dynamically adding and removing processors from clusters at any time. Since this is part of the existing design, the proposed scheduler must also support this behaviour. However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes. This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster. How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance. The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it, which can introduce memory reclamation problems if not done correctly. 348 The \CFA runtime system groups processors together as \newterm{clusters}, as shown in Figure~\ref{fig:system}. 349 Threads on a cluster are always scheduled on one of the processors of the cluster. 350 Currently, the runtime handles dynamically adding and removing processors from clusters at any time. 351 Since this is part of the existing design, the proposed scheduler must also support this behaviour. 352 However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes. 353 This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system. 354 As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster. 355 How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long period of times. 356 However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance. 357 The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks. 358 Since the underlying queues are stored in a dense array, changing the number of queues requires resizing the array and expanding the array requires moving it, which can introduce memory reclamation problems if not done correctly. 244 359 245 360 \begin{figure} … … 251 366 \end{figure} 252 367 253 It is important to note how the array is used in this case. While the array cells are modified by every push and pop operation, the array itself, \ie the pointer that would change when resized, is only read during these operations. Therefore the use of this pointer can be described as frequent reads and infrequent writes. This description effectively matches with the description of a reader-writer lock, infrequent but invasive updates among frequent read operations. In the case of the ready queue described above, read operations are operations that push or pop from the ready queue but do not invalidate any references to the ready queue data structures. Writes on the other hand would add or remove inner queues, invalidating references to the array of inner queues in a process. Therefore, the current proposed approach to this problem is to add a per-cluster reader-writer lock around the ready queue to prevent restructuring of the ready-queue data-structure while threads are being pushed or popped. 254 255 There are possible alternatives to the reader-writer lock solution. This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject\cit. However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution. 368 It is important to note how the array is used in this case. 369 While the array cells are modified by every push and pop operation, the array itself, \ie the pointer that would change when resized, is only read during these operations. 370 Therefore the use of this pointer can be described as frequent reads and infrequent writes. 371 This description effectively matches with the description of a reader-writer lock, infrequent but invasive updates among frequent read operations. 372 In the case of the ready queue described above, read operations are operations that push or pop from the ready queue but do not invalidate any references to the ready queue data structures. 373 Writes on the other hand would add or remove inner queues, invalidating references to the array of inner queues in a process. 374 Therefore, the current proposed approach to this problem is to add a per-cluster reader-writer lock around the ready queue to prevent restructuring of the ready-queue data-structure while threads are being pushed or popped. 375 376 There are possible alternatives to the reader-writer lock solution. 377 This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject\cit. 378 However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution. 256 379 257 380 \paragraph{Objectives and Existing Work} 258 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck. I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. As such, I do not consider this lock to be a risk for this project. 381 The lock must offer scalability and performance on par with the actual ready-queue in order not to introduce a new bottleneck. 382 I have already built a lock that fits the desired requirements and preliminary testing show scalability and performance that exceed the target. 383 As such, I do not consider this lock to be a risk for this project. 259 384 260 385 \subsection{Idle Sleep} \label{sec:sleep} 261 386 262 \newterm{Idle sleep} is the process of putting processors to sleep when they have no threads to execute. In this context, processors are kernel threads and sleeping refers to asking the kernel to block a thread. This operation can be achieved with either thread synchronization operations like $pthread_cond_wait$ or using signal operations like $sigsuspend$. The goal of putting idle processors to sleep is: 387 \newterm{Idle sleep} is the process of putting processors to sleep when they have no threads to execute. 388 In this context, processors are kernel threads and sleeping refers to asking the kernel to block a thread. 389 This operation can be achieved with either thread synchronization operations like $pthread_cond_wait$ or using signal operations like $sigsuspend$. 390 The goal of putting idle processors to sleep is: 263 391 \begin{enumerate} 264 392 \item … … 269 397 and reduce energy consumption in cases where more idle kernel-threads translate to idle CPUs, which can cycle down. 270 398 \end{enumerate} 271 Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake up. 272 273 When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. This operation is equivalent to the classic problem of missing signals when using condition variables: the sleeping processor indicates its intention to block but has not yet gone to sleep when another processor attempts to wake it up. The waking-up operation sees the blocked process and signals it, but the blocking process is racing to sleep so the signal is missed. In cases where kernel threads are managed as processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually. Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable). However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster. In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} write \CFA code that leads to a deadlock.}. Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleeping processor notices that a thread is blocked after it signalled it or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools. 274 275 Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost. This scenario happens when a program oscillates between high and low activity, needing most and then less processors. A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. Note that while this generally means organizing the processors in a stack, I believe that the unique index provided in my reader-writer lock can be reused to strictly order the waking processors, causing a mostly LIFO order. While a strict LIFO stack is probably better, the processor index could prove useful for other reasons, while still offering a sufficiently LIFO ordering. 276 277 A final important aspect of idle sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up. Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to sub-optimal throughput. Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. There is already a wealth of research on the subject \TODO{(reference Libfibre)} and I may use an existing approach for the idle-sleep heuristic in this project. 399 Support for idle sleep broadly involves calling the operating system to block the kernel thread and handling the race between a blocking thread and the waking thread, and handling which kernel thread should sleep or wake up. 400 401 When a processor decides to sleep, there is a race that occurs between it signalling that is going to sleep (so other processors can find sleeping processors) and actually blocking the kernel thread. 402 This operation is equivalent to the classic problem of missing signals when using condition variables: the ``sleepy'' processor indicates its intention to block but has not yet gone to sleep when another processor attempts to wake it up. 403 The waking-up operation sees the blocked process and signals it, but the blocking process is racing to sleep so the signal is missed. 404 In cases where kernel threads are managed as processors on the current cluster, loosing signals is not necessarily critical, because at least some processors on the cluster are awake and may check for more processors eventually. 405 Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable). 406 However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster. 407 In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} write \CFA code that leads to a deadlock.}. 408 Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed. 409 For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock. 410 To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor. 411 This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools. 412 413 Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost. 414 This scenario happens when a program oscillates between high and low activity, needing most and then less processors. 415 A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up. 416 This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer. 417 Note that while this generally means organizing the processors in a stack, I believe that the unique index provided in my reader-writer lock can be reused to strictly order the waking processors, causing a mostly LIFO order. 418 While a strict LIFO stack is probably better, the processor index could prove useful for other reasons, while still offering a sufficiently LIFO ordering. 419 420 A final important aspect of idle sleep is when should processors make the decision to sleep and when is it appropriate for sleeping processors to be woken up. 421 Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to sub-optimal throughput. 422 Furthermore, transitions from sleeping to awake and vice-versa also add unnecessary latency. 423 There is already a wealth of research on the subject \TODO{(reference Libfibre)} and I may use an existing approach for the idle-sleep heuristic in this project. 278 424 279 425 \subsection{Asynchronous I/O} 280 426 281 The final aspect of this proposal is asynchronous I/O. Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput. It is preferable to block the user thread performing the I/O and reuse the underlying kernel-thread to run other ready user threads. This approach requires intercepting user-thread calls to I/O operations, redirecting them to an asynchronous I/O interface, and handling the multiplexing/demultiplexing between the synchronous and asynchronous API. As such, there are three components needed to implemented support for asynchronous I/O: 427 The final aspect of this proposal is asynchronous I/O. 428 Without it, user threads that execute I/O operations block the underlying kernel thread, which leads to poor throughput. 429 It is preferable to block the user thread performing the I/O and reuse the underlying kernel-thread to run other ready user threads. 430 This approach requires intercepting user-thread calls to I/O operations, redirecting them to an asynchronous I/O interface, and handling the multiplexing/demultiplexing between the synchronous and asynchronous API. 431 As such, there are three components needed to implemented support for asynchronous I/O: 282 432 \begin{enumerate} 283 433 \item … … 291 441 292 442 \paragraph{OS Abstraction} 293 One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations. While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API. It is sufficient to make one work in the complex context of the \CFA runtime. \uC uses the $select$ as its interface, which handles ttys, pipes and sockets, but not disk. $select$ entails significant complexity and is being replaced in UNIX operating-systems, which make it a less interesting alternative. Another popular interface is $epoll$\cit, which is supposed to be cheaper than $select$. However, epoll also does not handle the file system and seems to have problem to linux pipes and $TTY$s\cit. A very recent alternative that I am investigating is $io_uring$. It claims to address some of the issues with $epoll$ but is too recent to be confident that it does. Finally, a popular cross-platform alternative is $libuv$, which offers asynchronous sockets and asynchronous file system operations (among other features). However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together. 443 One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations. 444 While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API. 445 It is sufficient to make one work in the complex context of the \CFA runtime. 446 \uC uses the $select$ as its interface, which handles ttys, pipes and sockets, but not disk. 447 $select$ entails significant complexity and is being replaced in UNIX operating-systems, which make it a less interesting alternative. 448 Another popular interface is $epoll$\cit, which is supposed to be cheaper than $select$. 449 However, epoll also does not handle the file system and seems to have problem to linux pipes and $TTY$s\cit. 450 A very recent alternative that I am investigating is $io_uring$. 451 It claims to address some of the issues with $epoll$ but is too recent to be confident that it does. 452 Finally, a popular cross-platform alternative is $libuv$, which offers asynchronous sockets and asynchronous file system operations (among other features). 453 However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together. 294 454 295 455 \paragraph{Event Engine} 296 Laying on top of the asynchronous interface layer is the event engine. This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user threads. This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors. Decisions that need to be made include: 456 Laying on top of the asynchronous interface layer is the event engine. 457 This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user threads. 458 This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors. 459 Decisions that need to be made include: 297 460 \begin{enumerate} 298 461 \item … … 305 468 306 469 \paragraph{Interface} 307 Finally, for these non-blocking I/O components to be available, it is necessary to expose them through a synchronous interface because that is the \CFA concurrent programming style. The interface can be novel but it is preferable to match the existing POSIX interface when possible to be compatible with existing code. Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort. Where new functionality is needed, I will create a novel interface to fill gaps and provide advanced features. 470 Finally, for these non-blocking I/O components to be available, it is necessary to expose them through a synchronous interface because that is the \CFA concurrent programming style. 471 The interface can be novel but it is preferable to match the existing POSIX interface when possible to be compatible with existing code. 472 Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort. 473 Where new functionality is needed, I will create a novel interface to fill gaps and provide advanced features. 308 474 309 475 -
doc/theses/thierry_delisle_PhD/comp_II/local.bib
r34d0a28 r5569a31 223 223 224 224 % =============================================================================== 225 % MISC 226 % =============================================================================== 225 % Algorithms 226 % =============================================================================== 227 @article{michael2004hazard, 228 title={Hazard pointers: Safe memory reclamation for lock-free objects}, 229 author={Michael, Maged M}, 230 journal={IEEE Transactions on Parallel and Distributed Systems}, 231 volume={15}, 232 number={6}, 233 pages={491--504}, 234 year={2004}, 235 publisher={IEEE} 236 } 237 238 @inproceedings{brown2015reclaiming, 239 title={Reclaiming memory for lock-free data structures: There has to be a better way}, 240 author={Brown, Trevor Alexander}, 241 booktitle={Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing}, 242 pages={261--270}, 243 year={2015} 244 } 245 227 246 % Trevor's relaxed FIFO list 228 247 @inproceedings{alistarh2018relaxed, … … 242 261 year={2007} 243 262 } 263 264 % =============================================================================== 265 % Linux Man Pages 266 % =============================================================================== 267 @manual{epoll, 268 title = "epoll(7) Linux User's Manual", 269 year = "2019", 270 month = "March", 271 } 272 273 @manual{select, 274 title = "select(7) Linux User's Manual", 275 year = "2019", 276 month = "March", 277 } 278 279 @misc{io_uring, 280 title = {Efficient IO with io\_uring}, 281 author = {Axboe, Jens}, 282 year = "2019", 283 month = "March", 284 version = {0,4}, 285 url = {https://kernel.dk/io_uring.pdf} 286 } 287 288 @misc{libuv, 289 title = {libuv}, 290 url = {https://github.com/libuv/libuv} 291 } 292 293 % =============================================================================== 294 % MISC 295 % =============================================================================== 296 297 % libfibre web server 298 @article{karstenuser, 299 title={User-level Threading: Have Your Cake and Eat It Too}, 300 author={KARSTEN, MARTIN and BARGHI, SAMAN} 301 } 302 303 % libfibre 304 @misc{libfibre, 305 title={libfibre}, 306 author={KARSTEN, MARTIN}, 307 journal={URL: https://git.uwaterloo.ca/mkarsten/libfibre} 308 } 309 310 @article{Delisle19, 311 keywords = {concurrency, Cforall}, 312 contributer = {pabuhr@plg}, 313 author = {Thierry Delisle and Peter A. Buhr}, 314 title = {Advanced Control-flow and Concurrency in \textsf{C}$\mathbf{\forall}$}, 315 year = 2019, 316 journal = spe, 317 pages = {1-33}, 318 note = {submitted}, 319 } 320 321 @article{schillings1996engineering, 322 title={Be engineering insights: Benaphores}, 323 author={Schillings, Benoit}, 324 journal={Be Newsletters}, 325 volume={1}, 326 number={26}, 327 year={1996} 328 } 329 330 @misc{wiki:thunderherd, 331 author = "{Wikipedia contributors}", 332 title = "Thundering herd problem --- {W}ikipedia{,} The Free Encyclopedia", 333 year = "2020", 334 url = "https://en.wikipedia.org/wiki/Thundering_herd_problem", 335 note = "[Online; accessed 14-April-2020]" 336 }
Note: See TracChangeset
for help on using the changeset viewer.