Changeset a44514e for doc/theses
- Timestamp:
- Sep 7, 2022, 4:12:00 PM (3 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- e4855f6
- Parents:
- 7a0f798b
- Location:
- doc/theses/thierry_delisle_PhD
- Files:
- 
      - 15 edited
 
 - 
          
  .gitignore (modified) (1 diff)
- 
          
  thesis/Makefile (modified) (2 diffs)
- 
          
  thesis/glossary.tex (modified) (12 diffs)
- 
          
  thesis/local.bib (modified) (3 diffs)
- 
          
  thesis/text/conclusion.tex (modified) (1 diff)
- 
          
  thesis/text/core.tex (modified) (8 diffs)
- 
          
  thesis/text/eval_macro.tex (modified) (2 diffs)
- 
          
  thesis/text/eval_micro.tex (modified) (10 diffs)
- 
          
  thesis/text/existing.tex (modified) (9 diffs)
- 
          
  thesis/text/front.tex (modified) (3 diffs)
- 
          
  thesis/text/intro.tex (modified) (9 diffs)
- 
          
  thesis/text/io.tex (modified) (11 diffs)
- 
          
  thesis/text/practice.tex (modified) (3 diffs)
- 
          
  thesis/text/runtime.tex (modified) (3 diffs)
- 
          
  thesis/thesis.tex (modified) (1 diff)
 
Legend:
- Unmodified
- Added
- Removed
- 
      doc/theses/thierry_delisle_PhD/.gitignorer7a0f798b ra44514e 20 20 thesis/fig/*.fig.bak 21 21 thesis/thesis.pdf 22 thesis/thesis.tty 22 23 thesis/thesis.ps 23 24 
- 
      doc/theses/thierry_delisle_PhD/thesis/Makefiler7a0f798b ra44514e 11 11 LaTeX = TEXINPUTS=${TeXLIB} && export TEXINPUTS && latex -halt-on-error -output-directory=${Build} 12 12 BibTeX = BIBINPUTS=${TeXLIB} && export BIBINPUTS && bibtex 13 DeTeX = TEXINPUTS=${TeXLIB} && export TEXINPUTS && detex -r 13 14 14 15 MAKEFLAGS = --no-print-directory # --silent … … 144 145 ${LaTeX} $< 145 146 147 %.tty: build/%.dvi 148 dvi2tty -w132 $< > $@ 149 146 150 ## Define the default recipes. 147 151 
- 
      doc/theses/thierry_delisle_PhD/thesis/glossary.texr7a0f798b ra44514e 14 14 % Definitions 15 15 16 \longnewglossaryentry{ thrd}17 {name={ thread}}16 \longnewglossaryentry{at} 17 {name={Thread},text={thread}} 18 18 { 19 Threads created and managed inside user-space. Each thread has its own stack and its own thread of execution. User-level threads are invisible to the underlying operating system.19 Abstract object representing a unit of work. Systems will offer one or more concrete implementations of this concept (\eg \gls{kthrd}, \gls{job}), however, most of the concepts of scheduling are independent of the particular implementations of the work representation. For this reason, this document uses the term \Gls{at} to mean any representation and not one in particular. 20 20 21 \textit{Synonyms : User threads, Lightweight threads, Green threads, Virtual threads, Tasks.}21 \textit{Synonyms : Tasks, Jobs, Blocks.} 22 22 } 23 23 24 24 \longnewglossaryentry{proc} 25 {name={ processor}}25 {name={Processor},text={processor}} 26 26 { 27 Entity that executes the \glspl{at}, \ie the resource being scheduled by the scheduler. In kernel level threading, \ats are kernel threads and \procs are the \glspl{hthrd} on which the kernel threads are scheduled. In user-level threading and in thread pools, \procs are kernel threads. 27 28 29 \textit{Synonyms : Server, Worker.} 28 30 } 29 31 30 32 \longnewglossaryentry{rQ} 31 {name={ ready-queue}}33 {name={Ready Queue}, text={ready-queue}} 32 34 { 33 35 Data structure holding \ats that are ready to be \glslink{atrun}{run}. Often a \glsxtrshort{fifo} queue, but can take many different forms, \eg binary trees are also common. 34 36 } 35 37 36 38 \longnewglossaryentry{uthrding} 37 {name={ user-level threading}}39 {name={User-Level Threading},text={user-level threading}} 38 40 { 39 41 Threading model where a scheduler runs in users space and maps threads managed and created inside the user-space onto \glspl{kthrd}. 40 42 41 43 \textit{Synonyms : User threads, Lightweight threads, Green threads, Virtual threads, Tasks.} … … 43 45 44 46 \longnewglossaryentry{rmr} 45 {name={ remote memory reference}}47 {name={Remote Memory Reference},text={remote memory reference}} 46 48 { 47 49 Reference to an address in memory that is considered \newterm{remote}, as opposed to \emph{local}. This can mean for example: a cache line that is in a cache not shared by the current \gls{hthrd}, a block of memory that belons to a different CPU socket in a \glsxtrshort{numa} context, etc. 48 50 } 49 51 … … 51 53 52 54 \longnewglossaryentry{hthrd} 53 {name={ hardware thread}}55 {name={Hardware Thread},text={hardware thread}} 54 56 { 55 57 Threads representing the underlying hardware directly, \eg the CPU core, or hyper-thread if the hardware supports multiple threads of execution per core. The number of hardware threads is considered to be always fixed to a specific number determined by the hardware. 56 58 57 \textit{Synonyms : }59 \textit{Synonyms : Core, Processing Unit, CPU.} 58 60 } 59 61 60 62 \longnewglossaryentry{kthrd} 61 {name={ kernel-level thread}}63 {name={Kernel-Level Thread},text={kernel-level thread}} 62 64 { 63 65 Threads created and managed inside kernel-space. Each thread has its own stack and its own thread of execution. Kernel-level threads are owned, managed and scheduled by the underlying operating system. … … 67 69 68 70 \longnewglossaryentry{fiber} 69 {name={ fiber}}71 {name={Fiber},text={fiber}} 70 72 { 71 73 Fibers are non-preemptive user-level threads. They share most of the caracteristics of user-level threads except that they cannot be preempted by another fiber. … … 75 77 76 78 \longnewglossaryentry{job} 77 {name={ job}}79 {name={Job},text={job}} 78 80 { 79 81 Unit of work, often sent to a thread pool or worker pool to be executed. Has neither its own stack nor its own thread of execution. … … 83 85 84 86 \longnewglossaryentry{pool} 85 {name={ thread-pool}}87 {name={Thread Pool},text={thread-pool}} 86 88 { 87 Group of homogeneuous threads that loop executing units of works after another.89 Group of homogeneuous threads that loop executing units of works. Often executing \glspl{jobs}. 88 90 89 \textit{Synonyms : }91 \textit{Synonyms : Executor.} 90 92 } 91 93 92 94 \longnewglossaryentry{preemption} 93 {name={ preemption}}95 {name={Preemption},text={preemption}} 94 96 { 95 97 Involuntary context switch imposed on threads at a given rate. … … 98 100 } 99 101 100 101 102 \longnewglossaryentry{at}103 {name={task}}104 {105 Abstract object representing an unit of work. Systems will offer one or more concrete implementations of this concept (\eg \gls{kthrd}, \gls{job}), however, most of the concept of schedulings are independent of the particular implementations of the work representation. For this reason, this document use the term \Gls{at} to mean any representation and not one in particular.106 }107 108 102 \longnewglossaryentry{atsched} 109 103 {name={Scheduling a \gls{at}}} 110 104 { 111 Scheduling a n \gls{at} refers to the act of notifying the scheduler that a task is ready to be ran. When representing the scheduler as a queue of tasks, scheduling is the act of pushing a task onto the end of the queue. This doesn't necesserily meansthe task will ever be allocated CPU time (\gls{atrun}), for example, if the system terminates abruptly, scheduled \glspl{at} will probably never run.105 Scheduling a \at refers to the act of notifying the scheduler that a task is ready to be run. When representing the scheduler as a queue of tasks, scheduling is the act of pushing a task onto the end of the queue. This does not necessarily mean the task will ever be allocated CPU time (\gls{atrun}), for example, if the system terminates abruptly, scheduled \glspl{at} will probably never run. 112 106 113 \textit{Synonyms : None.}107 \textit{Synonyms : Unparking.} 114 108 } 115 109 … … 117 111 {name={Running a \gls{at}}} 118 112 { 119 Running a n \gls{at} refers to the act of allocating CPU time to a task that is ready to run. When representing the scheduler as a queue of tasks, running is the act of poping a task from the front of the queue and putting it onto a \gls{proc}. The \gls{at} can than accomplish some or all of the work it is programmed to do.113 Running a \at refers to the act of allocating CPU time to a task that is ready to run. When representing the scheduler as a queue of tasks, running is the act of popping a task from the front of the queue and putting it onto a \gls{proc}. The \gls{at} can then accomplish some or all of the work it is programmed to do. 120 114 121 115 \textit{Synonyms : None.} … … 123 117 124 118 \longnewglossaryentry{atmig} 125 {name={ migration of \gls{at}}}119 {name={Migration of \glspl{at}}} 126 120 { 127 Migration refers to the idea of an \gls{at} running on a different worker/processor than the last time it was run. It is generally preferable to minimise migration as it incurs cost but any load balancing among workersrequires some amount of migration.121 Migration refers to the idea of an \gls{at} running on a different \proc than the last time it was run. It is generally preferable to minimise migration as it incurs cost but any load balancing among \proc requires some amount of migration. 128 122 129 123 \textit{Synonyms : None.} … … 131 125 132 126 \longnewglossaryentry{atpass} 133 {name={ overtaking \gls{at}}}127 {name={Overtaking \gls{at}}} 134 128 { 135 129 When representing the scheduler as a queue of \glspl{at}, overtaking is the act breaking the FIFO-ness of the queue by moving a \gls{at} in front of some other \gls{at} when it arrived after. This remains true for schedulers that do not use a FIFO queue, when the order in which the \glspl{at} are \glslink{atsched}{scheduled} and \glslink{atrun}{run} in a different order. A \gls{at} is said to \emph{overtake} another if it is run \emph{before} but was \emph{scheduled} after the other \gls{at}. … … 143 137 Blocking an abstract task refers to the act of taking a task that us running on a CPU off the CPU. Unless no other task is ready, this action is generally immediately followed by running an other task. 144 138 145 \textit{Synonyms : None.}139 \textit{Synonyms : Parking.} 146 140 } 147 141 … … 157 151 158 152 \longnewglossaryentry{load} 159 {name={System Load} }153 {name={System Load},text={load}} 160 154 { 161 The load is refers to the rate at which \glspl{at} are \glslink{atsched}{scheduled} versus the rate at which they are \glslink{atrun}{run}. When \glspl{at} are being scheduled faster than they are run, the system is considered \emph{overloaded}. When \glspl{at} are being run faster than they are scheduled, the system is considered \emph{underloaded}. Conrrespondingly, if both rates are equal, the system is considered \emph{loaded}. Note that the system is considered loaded only of the rate at which \glspl{at} are scheduled/run is non-zero, otherwise the system is empty, it has no load. 155 The System Load refers to the rate at which \glspl{at} are \glslink{atsched}{scheduled} versus the rate at which they are \glslink{atrun}{run}. When \glspl{at} are being scheduled faster than they are run, the system is considered \emph{overloaded}. When \glspl{at} are being run faster than they are scheduled, the system is considered \emph{underloaded}. Conrrespondingly, if both rates are equal, the system is considered \emph{loaded}. Note that the system is considered loaded only of the rate at which \glspl{at} are scheduled/run is non-zero, otherwise the system is empty, it has no load. 156 157 \textit{Synonyms : CPU Load, System Load.} 162 158 } 163 159 
- 
      doc/theses/thierry_delisle_PhD/thesis/local.bibr7a0f798b ra44514e 714 714 } 715 715 716 @misc{GITHUB:SchedulingBenchmarks, 717 title = {Scheduling Benchmarks}, 718 author = {Thierry Delisle}, 719 howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-SchedulingBenchmarks\_\-PhD22}}, 720 } 721 716 722 % -------------------------------------------------- 717 723 % Tech documents … … 822 828 } 823 829 830 @manual{MAN:eventfd, 831 key = "eventfd", 832 title = "eventfd(2) Linux User's Manual", 833 year = "2019", 834 month = "MArch", 835 } 836 824 837 @manual{MAN:aio, 825 838 key = "aio", … … 934 947 howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", 935 948 note = "[Online; accessed 5-August-2022]" 949 } 950 951 @misc{wiki:htm, 952 author = "{Wikipedia contributors}", 953 title = "Transactional memory --- {W}ikipedia{,} The Free Encyclopedia", 954 year = "2022", 955 howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", 956 note = "[Online; accessed 7-September-2022]" 936 957 } 937 958 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/conclusion.texr7a0f798b ra44514e 113 113 In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. 114 114 115 Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{ HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.115 Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. 116 116 However, I believe this feature is generally aimed at large groups of instructions. 117 117 A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/core.texr7a0f798b ra44514e 2 2 3 3 Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state. 4 For this purpose, a steady state is loosely defined as a state where there are always \ glspl{thrd}ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.4 For this purpose, a steady state is loosely defined as a state where there are always \ats ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers. 5 5 In short, the system is neither overloaded nor underloaded. 6 6 7 7 It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. 8 As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new loadand return to the steady state, \eg, by adding or removing workers.8 As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new \gls{load} and return to the steady state, \eg, by adding or removing workers. 9 9 Therefore, flaws in scheduling the steady state tend to be pervasive in all states. 10 10 … … 20 20 \end{displayquote} 21 21 22 Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.23 24 In general, the expectation at the center of this model is that ready \ glspl{thrd}do not interfere with each other but simply share the hardware.25 This assumption makes it easier to reason about threading because ready \ glspl{thrd}can be thought of in isolation and the effect of the scheduler can be virtually ignored.26 This expectation of \ gls{thrd}independence means the scheduler is expected to offer two guarantees:22 Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model. 23 24 In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware. 25 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. 26 This expectation of \at independence means the scheduler is expected to offer two guarantees: 27 27 \begin{enumerate} 28 \item A fairness guarantee: a \ gls{thrd}that is ready to run is not prevented by another thread.29 \item A performance guarantee: a \ gls{thrd}that wants to start or stop running is not prevented by other threads wanting to do the same.28 \item A fairness guarantee: a \at that is ready to run is not prevented by another thread. 29 \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same. 30 30 \end{enumerate} 31 31 32 32 It is important to note that these guarantees are expected only up to a point. 33 \Glspl{ thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.34 Therefore, the guarantee is considered respected if a \ gls{thrd}gets access to a \emph{fair share} of the hardware resources, even if that share is very small.33 \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources. 34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. 35 35 36 36 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point. … … 59 59 For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved. 60 60 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale. 61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general loadof the system.61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system. 62 62 63 63 \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal} … … 68 68 69 69 For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. 70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \ gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available.70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available. 71 71 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling. 72 72 External locality is a much more complicated subject and is discussed in the next section. … … 80 80 \input{fairness.pstex_t} 81 81 \vspace*{-10pt} 82 \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \ gls{thrd} awaits running is shown as the time the ready \gls{thrd}waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.83 At the same time, the need for fairness increases since other \ glspl{thrd}may have the chance to run many times, breaking the fairness model.82 \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \at awaits running is shown as the time the ready \at waits increases, Ready Time, the chances that its data is still in cache decreases, Locality. 83 At the same time, the need for fairness increases since other \ats may have the chance to run many times, breaking the fairness model. 84 84 Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.} 85 85 \label{fig:fair} … … 97 97 98 98 \subsubsection{Migration Cost} 99 Another important source of scheduling latency is migration.99 Another important source of scheduling latency is \glslink{atmig}{migration}. 100 100 A \at migrates if it executes on two different \procs consecutively, which is the process discussed in \ref{fairnessvlocal}. 101 101 Migrations can have many different causes, but in certain programs, it can be impossible to limit migration. … … 250 250 To compare subqueues, the timestamp at the head must be compared to the current time, yielding the best-case wait-time for the \at at the head of the queue. 251 251 This new waiting is averaged with the stored average. 252 To further limit migration, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.252 To further limit \glslink{atmig}{migrations}, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average. 253 253 Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. 254 254 … … 261 261 The good news is that this problem can be mitigated 262 262 263 \subsection{Redundant Timestamps}\ ref{relaxedtimes}263 \subsection{Redundant Timestamps}\label{relaxedtimes} 264 264 The problem with polling remote subqueues is that correctness is critical. 265 265 There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.texr7a0f798b ra44514e 49 49 Threads that are not currently dealing with another request ignore the incoming packet. 50 50 One of the remaining, nonbusy, threads reads the request and sends the response. 51 This implementation can lead to increased CPU loadas threads wake from sleep to potentially process the request.51 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. 52 52 \end{itemize} 53 53 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O. … … 273 273 It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. 274 274 \item 275 \todo{switch} 275 Network routing is performed by a HP 2530 10 Gigabit Ethernet switch. 276 276 \item 277 277 A client machine runs two copies of the workload generator. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.texr7a0f798b ra44514e 6 6 The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments. 7 7 Note, only the code of the \CFA tests is shown; 8 all tests in the other systems are functionally identical and available online~\cite{ SchedulingBenchmarks}.8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}. 9 9 10 10 \section{Benchmark Environment}\label{microenv} … … 63 63 For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}. 64 64 This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list. 65 At runtime, each \at unparks the next \at before parkingitself.66 Unparking the next \at pushes that \at onto the ready queue while the ensuing park leads to a \at being popped from the ready queue.65 At runtime, each \at unparks the next \at before \glslink{atblock}{parking} itself. 66 Unparking the next \at pushes that \at onto the ready queue while the ensuing \park leads to a \at being popped from the ready queue. 67 67 68 68 \begin{figure} 69 69 \centering 70 70 \input{cycle.pstex_t} 71 \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before parkingitself.}71 \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before \glslink{atblock}{parking} itself.} 72 72 \label{fig:cycle} 73 73 \end{figure} 74 74 75 75 Therefore, the underlying runtime cannot rely on the number of ready \ats staying constant over the duration of the experiment. 76 In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at unparking and the current \at parking.76 In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at \glslink{atsched}{unparking} and the current \at \glslink{atblock}{parking}. 77 77 That is, the runtime cannot anticipate that the current task immediately parks. 78 78 As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \at parks because of time-slicing or multiple \procs. 79 79 If this happens, the scheduler push and pop are avoided and the results of the experiment are skewed. 80 (Note, an unpark is like a V on a semaphore, so the subsequentpark (P) may not block.)80 (Note, an \unpark is like a V on a semaphore, so the subsequent \park (P) may not block.) 81 81 Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops. 82 To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully park before being unparked again.82 To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully \park before being unparked again. 83 83 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment. 84 84 85 85 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats. 86 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@and carefully picking the order of the @P@ and @V@ with respect to the loop condition.86 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw \park/\unpark and carefully picking the order of the @P@ and @V@ with respect to the loop condition. 87 87 88 88 \begin{figure} … … 416 416 An interesting aspect to note here is that the runtimes differ in how they handle this situation. 417 417 Indeed, when a \proc unparks a \at that was last run on a different \proc, the \at could be appended to the ready queue of the local \proc or to the ready queue of the remote \proc, which previously ran the \at. 418 \CFA, Tokio and Go all use the approach of unparkingto the local \proc, while Libfibre unparks to the remote \proc.418 \CFA, Tokio and Go all use the approach of \glslink{atsched}{unparking} to the local \proc, while Libfibre unparks to the remote \proc. 419 419 In this particular benchmark, the inherent chaos of the benchmark, in addition to small memory footprint, means neither approach wins over the other. 420 420 … … 485 485 Up to 32 \procs, after which the other runtime manage to outscale Go. 486 486 487 In conclusion, the objective of this benchmark is to demonstrate that unparking\ats from remote \procs does not cause too much contention on the local queues.487 In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues. 488 488 Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstrate migrations do not need to be serialized. 489 489 Again these result demonstrate \CFA achieves satisfactory performance with respect to the other runtimes. … … 491 491 \section{Locality} 492 492 493 As mentioned in the churn benchmark, when unparking a \at, it is possible to eitherunpark to the local or remote ready-queue.\footnote{494 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}493 As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{ 494 It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.} 495 495 The locality experiment includes two variations of the churn benchmark, where a data array is added. 496 496 In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array. … … 499 499 Figure~\ref{fig:locality:code} shows pseudo code for this benchmark. 500 500 501 The objective here is to highlight the different decision made by the runtime when unparking.501 The objective here is to highlight the different decision made by the runtime when \glslink{atsched}{unparking}. 502 502 Since each thread unparks a random semaphore, it means that it is unlikely that a \at is unparked from the last \proc it ran on. 503 In the noshare variation, unparkingthe \at on the local \proc is an appropriate choice since the data was last modified on that \proc.504 In the shared variation, unparkingthe \at on a remote \proc is an appropriate choice.505 506 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparkingpolicy.503 In the noshare variation, \glslink{atsched}{unparking} the \at on the local \proc is an appropriate choice since the data was last modified on that \proc. 504 In the shared variation, \glslink{atsched}{unparking} the \at on a remote \proc is an appropriate choice. 505 506 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their \glslink{atsched}{unparking} policy. 507 507 This decision should lead to \CFA, Go and Tokio achieving better performance in the share variation while libfibre achieves better performance in noshare. 508 Indeed, \CFA, Go and Tokio have the default policy of unparking \ats on the local \proc, where as libfibre has the default policy of unparking\ats wherever they last ran.508 Indeed, \CFA, Go and Tokio have the default policy of \glslink{atsched}{unparking} \ats on the local \proc, where as libfibre has the default policy of \glslink{atsched}{unparking} \ats wherever they last ran. 509 509 510 510 \begin{figure} … … 554 554 \vrule 555 555 \hspace{3pt} 556 \subfloat[Share]{\label{fig:locality:code:T 1}\usebox\myboxB}556 \subfloat[Share]{\label{fig:locality:code:T2}\usebox\myboxB} 557 557 558 558 \caption[Locality Benchmark : Pseudo Code]{Locality Benchmark : Pseudo Code} … … 566 566 Looking at the left column on Intel, Figures~\ref{fig:locality:jax:share:ops} and \ref{fig:locality:jax:share:ns} show the results for the share variation. 567 567 \CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach. 568 \CFA and Tokio both unpark locally and do not suffer cache misses on the transferred array.568 \CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array. 569 569 Libfibre on the other hand unparks remotely, and as such the unparked \at is likely to miss on the shared data. 570 570 Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark. … … 640 640 Indeed, in this case, unparking remotely means the unparked \at is less likely to suffer a cache miss on the array, which leaves the \at data structure and the remote queue as the only source of likely cache misses. 641 641 Results show both are amortized fairly well in this case. 642 \CFA and Tokio both unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array.642 \CFA and Tokio both \unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array. 643 643 644 644 Looking at the results for the AMD architecture, Figure~\ref{fig:locality:nasus}, shows results similar to the Intel. … … 651 651 Go still has the same poor performance. 652 652 653 Overall, this benchmark mostly demonstrates the two options available when unparkinga \at.653 Overall, this benchmark mostly demonstrates the two options available when \glslink{atsched}{unparking} a \at. 654 654 Depending on the workload, either of these options can be the appropriate one. 655 655 Since it is prohibitively difficult to dynamically detect which approach is appropriate, all runtimes much choose one of the two and live with the consequences. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/existing.texr7a0f798b ra44514e 7 7 Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms. 8 8 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input. 9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well 10 Note, this description includes both information about each request s, \eg time to complete or resources needed, and information about the relationships among request, \eg whether or notsome request must be completed before another request starts.9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well-informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information. 10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among request, \eg whether some request must be completed before another request starts. 11 11 12 12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling, since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods of time. … … 29 29 \newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all. 30 30 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime. 31 This detection takes the form of observing new \ats (s) in the system and determining dependencies from their behaviour, including suspending or halting a \atsthat dynamically detects unfulfilled dependencies.32 Furthermore, each \at shas the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.31 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies. 32 Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled. 33 33 As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies. 34 34 35 35 \subsection{Explicitly Informed Dynamic Schedulers} 36 While dynamic schedulers may not have an exhaustive list of dependencies for a \at s, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.36 While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc. 37 37 When available, a scheduler can then use this information to direct the scheduling decisions. 38 38 For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage. 39 39 However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information; 40 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \at stakes approximately 3--7 seconds to complete, rather than exactly 5 seconds.41 Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.40 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \at takes approximately 3--7 seconds to complete, rather than exactly 5 seconds. 41 Providing this kind of information is a significant programmer burden, especially if the information does not scale with the number of \ats and their complexity. 42 42 For example, providing an exhaustive list of files read by 5 \ats is an easier requirement then providing an exhaustive list of memory addresses accessed by 10,000 independent \ats. 43 43 … … 46 46 \subsubsection{Priority Scheduling} 47 47 Common information used by schedulers to direct their algorithm is priorities. 48 Each \at s is given a priorityand higher-priority \ats are preferred to lower-priority ones.49 The simplest priority scheduling algorithm is to require that every \at shave a distinct pre-established priority and always run the available \ats with the highest priority.48 Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones. 49 The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority. 50 50 Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \ats. 51 51 It can therefore be desirable for schedulers to support \ats with identical priorities and/or automatically setting and adjusting priorities for \ats. … … 54 54 55 55 \subsection{Uninformed and Self-Informed Dynamic Schedulers} 56 Several scheduling algorithms do not require programmers to provide additional information on each \at s, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.56 Several scheduling algorithms do not require programmers to provide additional information on each \at, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler. 57 57 58 58 59 59 \subsubsection{Feedback Scheduling} 60 As mentioned, schedulers may also gather information about each \at sto direct their decisions.60 As mentioned, schedulers may also gather information about each \at to direct their decisions. 61 61 This design effectively moves the scheduler into the realm of \newterm{Control Theory}~\cite{wiki:controltheory}. 62 62 This information gathering does not generally involve programmers, and as such, does not increase programmer burden the same way explicitly provided information may. 63 63 However, some feedback schedulers do allow programmers to offer additional information on certain \ats, in order to direct scheduling decisions. 64 The important distinction being whether or notthe scheduler can function without this additional information.64 The important distinction being whether the scheduler can function without this additional information. 65 65 66 66 67 67 \section{Work Stealing}\label{existing:workstealing} 68 68 One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work stealing. 69 This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.70 \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each worker s has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm had trees of \ats and stealonly among neighbours.}.69 This idea, introduced by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats. 70 \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each worker has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm has trees of \ats and steals only among neighbours.}. 71 71 Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations. 72 72 … … 82 82 In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting. 83 83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others. 84 A n common example where this is statically true is architectures with \acrshort{numa}.85 In these cases, it can be relevant to change the scheduler to be cogniz ent of the topology~\cite{vikranth2013topology,min2011hierarchical}.84 A common example where this is statically true is architectures with \glsxtrshort{numa}. 85 In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}. 86 86 Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}. 87 87 88 88 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures. 89 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.90 As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis.91 Although tit could be an interesting avenue for future work.89 This is arguably strongly related to Task Placement, but extends into more heterogeneous architectures. 90 As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis. 91 Although it could be an interesting avenue for future work. 92 92 93 93 \subsection{Theoretical Results} 94 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.94 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. 95 95 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}. 96 96 Others show that work stealing is applicable to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. … … 101 101 102 102 \section{Preemption} 103 One last aspect of scheduling is preemption since many schedulers rely on it for some of their guarantees.103 One last aspect of scheduling is preemption, since many schedulers rely on it for some of their guarantees. 104 104 Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application. 105 There are multiple techniques to achieve this effect but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.105 There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration. 106 106 While this helps schedulers guarantee that no \ats unfairly monopolizes a worker, preemption can effectively be added to any scheduler. 107 Therefore, the only interesting aspect of preemption for the design of scheduling is whether or notto require it.107 Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it. 108 108 109 109 \section{Production Schedulers}\label{existing:prod} … … 122 122 The default scheduler used by Linux, the Completely Fair Scheduler~\cite{MAN:linux/cfs,MAN:linux/cfs2}, is a feedback scheduler based on CPU time. 123 123 For each processor, it constructs a Red-Black tree of \ats waiting to run, ordering them by the amount of CPU time used. 124 The \at sthat has used the least CPU time is scheduled.124 The \at that has used the least CPU time is scheduled. 125 125 It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time used. 126 126 The ordering of \ats is also affected by a group based notion of fairness, where \ats belonging to groups having used less CPU time are preferred to \ats belonging to groups having used more CPU time. 127 Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the load, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.128 129 \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly.130 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at s and the other with one thousand \ats, the user with a single \atsdoes not receive one thousandth of the CPU time.}, increasing the complexity.127 Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the \gls{load}, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}. 128 129 \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each \proc, but the paper argues this aspect can be improved significantly. 130 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one thousandth of the CPU time.}, increasing the complexity. 131 131 132 132 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched} … … 140 140 Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. 141 141 It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications. 142 It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \at shas used.142 It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \at has used. 143 143 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. 144 144 … … 184 184 Erlang is a functional language that supports concurrency in the form of processes: threads that share no data. 185 185 It uses a kind of round-robin scheduler, with a mix of work sharing and stealing to achieve load balancing~\cite{:erlang}, where under-loaded workers steal from other workers, but overloaded workers also push work to other workers. 186 This migrationlogic is directed by monitoring logic that evaluates the load a few times per seconds.186 This \glslink{atmig}{migration} logic is directed by monitoring logic that evaluates the load a few times per seconds. 187 187 188 188 \paragraph{Intel\textregistered ~Threading Building Blocks} 189 189 \newterm{Thread Building Blocks} (TBB) is Intel's task parallelism \cite{wiki:taskparallel} framework. 190 It runs \newterm{jobs}, which are uninterrupt able \ats that must always run to completion, on a pool of worker threads.190 It runs \newterm{jobs}, which are uninterruptible \ats that must always run to completion, on a pool of worker threads. 191 191 TBB's scheduler is a variation of randomized work-stealing that also supports higher-priority graph-like dependencies~\cite{MAN:tbb/scheduler}. 192 It schedules \ats as follows (where \textit{t} is the last \at scompleted):192 It schedules \ats as follows (where \textit{t} is the last \at completed): 193 193 \begin{displayquote} 194 194 \begin{enumerate} 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/front.texr7a0f798b ra44514e 124 124 125 125 User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. 126 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi -core systems.126 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. 127 127 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 128 128 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; … … 135 135 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for user-level threading. 136 136 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. 137 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.137 Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. 138 138 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases. 139 139 … … 146 146 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 147 147 The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using @select@, @epoll@, \etc. 148 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside ofthe application.148 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. 149 149 150 150 \cleardoublepage 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/intro.texr7a0f798b ra44514e 2 2 3 3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. 4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi -core systems.4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. 5 5 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 6 6 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 7 which begs ofthe question of how many kernel threads are needed and should the number be dynamically reevaluated.7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated. 8 8 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 9 9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources. … … 13 13 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}. 14 14 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. 15 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.15 Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. 16 16 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases. 17 17 18 18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system. 19 19 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. 20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness. 21 21 22 22 Chapter~\ref{intro} defines scheduling and its general goals. 23 23 Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others. 24 Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation.24 Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant effect on the new scheduler design and implementation. 25 25 Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness. 26 26 Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}. … … 31 31 \section{Scheduling}\label{sched} 32 32 Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones. 33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}.33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}. 34 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit. 35 35 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. … … 39 39 However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns. 40 40 41 Scheduling occurs at discre etpoints when there are transitions in a system.42 For example, a threadcycles through the following transitions during its execution.41 Scheduling occurs at discrete points when there are transitions in a system. 42 For example, a \at cycles through the following transitions during its execution. 43 43 \begin{center} 44 44 \input{executionStates.pstex_t} … … 49 49 entering the system (new $\rightarrow$ ready) 50 50 \item 51 scheduler assigns a threadto a computing resource, \eg CPU (ready $\rightarrow$ running)51 scheduler assigns a \at to a computing resource, \eg CPU (ready $\rightarrow$ running) 52 52 \item 53 53 timer alarm for preemption (running $\rightarrow$ ready) … … 59 59 normal completion or error, \eg segment fault (running $\rightarrow$ halted) 60 60 \end{itemize} 61 Key to scheduling is that a threadcannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.61 Key to scheduling is that a \at cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads. 62 62 63 63 When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. … … 71 71 \end{tabular} 72 72 \end{center} 73 Beyond these two schedulers are a host of options, \eg adding a n global shared queue to MQMS or adding multiple private queues with distinccharacteristics.73 Beyond these two schedulers are a host of options, \eg adding a global shared queue to MQMS or adding multiple private queues with distinct characteristics. 74 74 75 75 Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria: … … 86 86 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. 87 87 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. 88 That is, threads must be scheduled on different processors to obtain high processor sutilization because the number of threads $\ggg$ processors.88 That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors. 89 89 90 90 \item … … 118 118 More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). 119 119 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 120 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside ofthe application.120 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. 121 121 122 122 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/io.texr7a0f798b ra44514e 1 1 \chapter{User Level \io}\label{userio} 2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ glspl{thrd}onto fewer \glspl{proc} using asynchronous \io operations.2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. 3 3 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system. 4 4 … … 14 14 It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try. 15 15 For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.} 16 This mechanism is also crucial in determining when all \ glspl{thrd}are blocked and the application \glspl{kthrd} can now block.16 This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. 17 17 18 18 There are three options to monitor file descriptors in Linux:\footnote{ … … 33 33 Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem. 34 34 35 \begin{comment}36 From: Tim Brecht <brecht@uwaterloo.ca>37 Subject: Re: FD sets38 Date: Wed, 6 Jul 2022 00:29:41 +000039 40 Large number of open files41 --------------------------42 43 In order to be able to use more than the default number of open file44 descriptors you may need to:45 46 o increase the limit on the total number of open files /proc/sys/fs/file-max47 (on Linux systems)48 49 o increase the size of FD_SETSIZE50 - the way I often do this is to figure out which include file __FD_SETSIZE51 is defined in, copy that file into an appropriate directory in ./include,52 and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size53 gets used54 55 For example on a RH 9.0 distribution I've copied56 /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h57 58 Then I modify typesizes.h to look something like:59 60 #ifdef BIGGER_FD_SETSIZE61 #define __FD_SETSIZE 3276762 #else63 #define __FD_SETSIZE 102464 #endif65 66 Note that the since I'm moving and testing the userver on may different67 machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)68 69 This way if you redefine the FD_SETSIZE it will get used instead of the70 default original file.71 \end{comment}35 % \begin{comment} 36 % From: Tim Brecht <brecht@uwaterloo.ca> 37 % Subject: Re: FD sets 38 % Date: Wed, 6 Jul 2022 00:29:41 +0000 39 40 % Large number of open files 41 % -------------------------- 42 43 % In order to be able to use more than the default number of open file 44 % descriptors you may need to: 45 46 % o increase the limit on the total number of open files /proc/sys/fs/file-max 47 % (on Linux systems) 48 49 % o increase the size of FD_SETSIZE 50 % - the way I often do this is to figure out which include file __FD_SETSIZE 51 % is defined in, copy that file into an appropriate directory in ./include, 52 % and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size 53 % gets used 54 55 % For example on a RH 9.0 distribution I've copied 56 % /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h 57 58 % Then I modify typesizes.h to look something like: 59 60 % #ifdef BIGGER_FD_SETSIZE 61 % #define __FD_SETSIZE 32767 62 % #else 63 % #define __FD_SETSIZE 1024 64 % #endif 65 66 % Note that the since I'm moving and testing the userver on may different 67 % machines the Makefiles are set up to use -I ./include/$(HOSTTYPE) 68 69 % This way if you redefine the FD_SETSIZE it will get used instead of the 70 % default original file. 71 % \end{comment} 72 72 73 73 \paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs. … … 139 139 \subsection{Extra Kernel Threads}\label{io:morethreads} 140 140 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing. 141 In the worst case, where all \ glspl{thrd}are consistently blocking on \io, it devolves into 1-to-1 threading.142 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ glspl{thrd}are ready to run.141 In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading. 142 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run. 143 143 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. 144 144 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. … … 155 155 \section{Event-Engine} 156 156 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. 157 In concrete terms, this means \ glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.158 The parked \ glspl{thrd}are then rescheduled by the event engine once the desired operation has completed.157 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}. 158 The parked \ats are then rescheduled by the event engine once the desired operation has completed. 159 159 160 160 \subsection{\lstinline{io_uring} in depth}\label{iouring} … … 268 268 \subsubsection{Private Instances} 269 269 The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling. 270 This alleviates the need for synchronization on the submissions, requiring only that \ glspl{thrd}are not time-sliced during submission steps.271 This requirement is the same as accessing @thread_local@ variables, where a \ gls{thrd}is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.270 This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps. 271 This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data. 272 272 This failure is the serially reusable problem~\cite{SeriallyReusable}. 273 273 Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{ 274 To remove this requirement, a \ gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie,park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}274 To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} 275 275 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. 276 276 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}. 277 277 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call. 278 Possible options are: when the \gls{proc} runs out of \ glspl{thrd} to run, after running a given number of \glspl{thrd}, \etc.278 Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. 279 279 280 280 \begin{figure} … … 288 288 289 289 This approach has the advantage that it does not require much of the synchronization needed in a shared approach. 290 However, this benefit means \ glspl{thrd} submitting \io operations have less flexibility: they cannotpark or yield, and several exceptional cases are handled poorly.291 Instances running out of SQEs cannot run \ glspl{thrd}wanting to do \io operations.292 In this case, the \io \ gls{thrd}needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.290 However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. 291 Instances running out of SQEs cannot run \ats wanting to do \io operations. 292 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed. 293 293 294 294 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. 295 \ Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migrationto a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.296 While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ glspl{thrd}to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.297 298 Imagine a simple scenario with two \ glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd}spins until the flag is set.299 Assume both \ glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.300 In this case, the helping solution has the \io \ gls{thrd}append an \io object to the submission list of the first \gls{proc}, where the allocation was made.301 No other \gls{proc} can help the \ gls{thrd}since @io_uring@ instances are strongly coupled to \glspl{proc}.302 However, the \io \gls{proc} is unable to help because it is executing the spinning \ gls{thrd}resulting in a deadlock.303 While this example is artificial, in the presence of many \ glspl{thrd}, it is possible for this problem to arise ``in the wild''.295 \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. 296 While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. 297 298 Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set. 299 Assume both \ats are running on the same \gls{proc}, and the \io \at is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \at. 300 In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. 301 No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. 302 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. 303 While this example is artificial, in the presence of many \ats, it is possible for this problem to arise ``in the wild''. 304 304 Furthermore, this pattern is difficult to reliably detect and avoid. 305 Once in this situation, the only escape is to interrupted the spinning \ gls{thrd}, either directly or via some regular preemption, \eg time slicing.306 Having to interrupt \ glspl{thrd}for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.305 Once in this situation, the only escape is to interrupted the spinning \at, either directly or via some regular preemption, \eg time slicing. 306 Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect. 307 307 Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using. 308 Therefore, a more satisfying solution is for the \ gls{thrd}submitting the operation to notice that the instance is unused and simply go ahead and use it.308 Therefore, a more satisfying solution is for the \at submitting the operation to notice that the instance is unused and simply go ahead and use it. 309 309 This approach is presented shortly. 310 310 311 311 \subsubsection{Public Instances} 312 312 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. 313 \ Glspl{thrd}attempting an \io operation pick one of the available instances and submit the operation to that instance.314 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ glspl{thrd}running on more than one \gls{proc} can attempt to submit to the same instance concurrently.313 \ats attempting an \io operation pick one of the available instances and submit the operation to that instance. 314 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. 315 315 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: 316 316 \begin{itemize} … … 327 327 The only added complexity is that the number of SQEs is fixed, which means allocation can fail. 328 328 329 Allocation failures need to be pushed to a routing algorithm: \ glspl{thrd}attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.329 Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. 330 330 Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs. 331 331 332 Once an SQE is allocated, \ glspl{thrd}insert the \io request information, and keep track of the SQE index and the instance it belongs to.332 Once an SQE is allocated, \ats insert the \io request information, and keep track of the SQE index and the instance it belongs to. 333 333 334 334 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call. … … 338 338 Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency. 339 339 Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted. 340 Balancing submission can be handled by either designating one of the submitting \ glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd}mentioned later in this section.341 342 Ideally, when multiple \ glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd}is designated to do the system call on behalf of the others, called the \newterm{submitter}.340 Balancing submission can be handled by either designating one of the submitting \ats as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \at mentioned later in this section. 341 342 Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}. 343 343 However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one. 344 Indeed, as long as there is a ``next'' submitter, \ glspl{thrd}submitting new \io requests can move on, knowing that some future system call includes their request.344 Indeed, as long as there is a ``next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request. 345 345 Once the system call is done, the submitter must also free SQEs so that the allocator can reused them. 346 346 347 347 Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point. 348 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ glspl{thrd}.348 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats. 349 349 Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. 350 350 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. 351 A simple approach to polling is to allocate a \ gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd}poll their respective instances when scheduled.351 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. 352 352 353 353 With the pool of SEQ instances approach, the big advantage is that it is fairly flexible. 354 It does not impose restrictions on what \ glspl{thrd}submitting \io operations can and cannot do between allocations and submissions.354 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. 355 355 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. 356 356 The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly. 357 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ glspl{thrd}are already queued up waiting for SQEs and handle SQEs being freed.357 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. 358 358 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. 359 359 All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead. … … 370 370 371 371 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. 372 When a \ gls{thrd}attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.373 This instance is now bound to the \gls{proc} the \ gls{thrd}is running on.372 When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. 373 This instance is now bound to the \gls{proc} the \at is running on. 374 374 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io. 375 375 This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach. … … 380 380 \item The current \gls{proc} does not hold an instance. 381 381 \item The current instance does not have sufficient SQEs to satisfy the request. 382 \item The current \gls{proc} has a wrong instance, this happens if the submitting \ gls{thrd}context-switched between allocation and submission, called \newterm{external submissions}.382 \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. 383 383 \end{enumerate} 384 384 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{ 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/practice.texr7a0f798b ra44514e 108 108 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements. 109 109 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore. 110 The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity. 111 111 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs. 112 112 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work. … … 125 125 126 126 \subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}} 127 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@.127 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe \park/\unpark of a \gls{kthrd} to/from a @pthread_cond@. 128 128 While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s. 129 129 For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc. … … 137 137 138 138 \subsection{Event FDs} 139 Another interesting approach is to use an event file descriptor\cite{ eventfd}.139 Another interesting approach is to use an event file descriptor\cite{MAN:eventfd}. 140 140 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore. 141 141 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits. 
- 
      doc/theses/thierry_delisle_PhD/thesis/text/runtime.texr7a0f798b ra44514e 13 13 \section{M:N Threading}\label{prev:model} 14 14 15 Threading in \CFA is based on \Gls{uthrding}, where \ glspl{thrd} are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \glspl{thrd} and switch among \glspl{thrd}liberally without many concerns for performance.15 Threading in \CFA is based on \Gls{uthrding}, where \ats are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \ats and switch among \ats liberally without many concerns for performance. 16 16 17 17 The \CFA M:N threading models is implemented using many user-level threads mapped onto fewer \glspl{kthrd}. 18 18 The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack. 19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \ gls{thrd} until it context switches out, it then chooses a different \gls{thrd}to run.19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run. 20 20 21 21 \section{Clusters} 22 22 \CFA allows the option to group user-level threading, in the form of clusters. 23 Both \ glspl{thrd}and \glspl{proc} belong to a specific cluster.24 \Glspl{ thrd} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.23 Both \ats and \glspl{proc} belong to a specific cluster. 24 \Glspl{at} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters. 25 25 Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism. 26 26 It also opens the door to handling effects like NUMA, by pinning clusters to a specific NUMA node\footnote{This capability is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for CPU masks.}. … … 30 30 \input{system.pstex_t} 31 31 \end{center} 32 \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{ thrd} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \glspl{thrd}.}32 \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{at} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \ats.} 33 33 \label{fig:system} 34 34 \end{figure} … … 38 38 39 39 \section{\glsxtrshort{io}}\label{prev:io} 40 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ glspl{thrd}. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \glspl{thrd}. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \gls{thrd}is ready to run. A simple example of this type of deadlock would be as follows:40 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows: 41 41 42 42 \begin{quote} 43 Given a simple network program with 2 \ glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd}waits for a response from the server.44 If the second \ gls{thrd}races ahead, it may wait for responses to requests that have not been sent yet.45 In theory, this should not be a problem, even if the second \ gls{thrd} waits, because the first \gls{thrd}is still ready to run and should be able to get CPU time to send the request.46 With M:N threading, while the first \ gls{thrd} is ready, the lone \gls{proc} \emph{cannot} run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}.43 Given a simple network program with 2 \ats and a single \gls{proc}, one \at sends network requests to a server and the other \at waits for a response from the server. 44 If the second \at races ahead, it may wait for responses to requests that have not been sent yet. 45 In theory, this should not be a problem, even if the second \at waits, because the first \at is still ready to run and should be able to get CPU time to send the request. 46 With M:N threading, while the first \at is ready, the lone \gls{proc} \emph{cannot} run the first \at if it is blocked in the \glsxtrshort{io} operation of the second \at. 47 47 If this happen, the system is in a synchronization deadlock\footnote{In this example, the deadlock could be resolved if the server sends unprompted messages to the client. 48 48 However, this solution is neither general nor appropriate even in this simple case.}. 49 49 \end{quote} 50 50 51 Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} operations.52 This feature entails multiplexing the \glsxtrshort{io} operations of many \ glspl{thrd}onto fewer \glspl{proc}.51 Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations. 52 This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}. 53 53 The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel. 54 54 This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration. 
- 
      doc/theses/thierry_delisle_PhD/thesis/thesis.texr7a0f798b ra44514e 211 211 \newcommand\proc{\gls{proc}\xspace}% 212 212 \newcommand\procs{\glspl{proc}\xspace}% 213 \newcommand\park{\glslink{atblock}{park}\xspace}% 214 \newcommand\unpark{\glslink{atsched}{unpark}\xspace}% 213 215 214 216 %====================================================================== 
  Note:
 See   TracChangeset
 for help on using the changeset viewer.
  