Context Navigation

← Previous Change
Next Change →

Changeset a44514e for doc/theses/thierry_delisle_PhD/thesis/text

Timestamp:

Sep 7, 2022, 4:12:00 PM (20 months ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation

Children:

Parents:

Message:

A whole bunch of small changes:
trying to setup a version that I can pass through a spell checker.
Fixing a whole bunch of grammar errors

Location:

doc/theses/thierry_delisle_PhD/thesis/text

Files:

: 10 edited

conclusion.tex (modified) (1 diff)
core.tex (modified) (8 diffs)
eval_macro.tex (modified) (2 diffs)
eval_micro.tex (modified) (10 diffs)
existing.tex (modified) (9 diffs)
front.tex (modified) (3 diffs)
intro.tex (modified) (9 diffs)
io.tex (modified) (11 diffs)
practice.tex (modified) (3 diffs)
runtime.tex (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

r7a0f798b	ra44514e
113	113	In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire.
114	114
115		Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{~~HTM~~}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
	115	Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
116	116	However, I believe this feature is generally aimed at large groups of instructions.
117	117	A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

-                      r7a0f798b
+                      ra44514e
 Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state.
 For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.
+For this purpose, a steady state is loosely defined as a state where there are always \ats ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.
 In short, the system is neither overloaded nor underloaded.
 It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected.
 As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new load and return to the steady state, \eg, by adding or removing workers.
+As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new \gls{load} and return to the steady state, \eg, by adding or removing workers.
 Therefore, flaws in scheduling the steady state tend to be pervasive in all states.
 …
 \end{displayquote}
 Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
 In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with each other but simply share the hardware.
 This assumption makes it easier to reason about threading because ready \glspl{thrd} can be thought of in isolation and the effect of the scheduler can be virtually ignored.
 This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
+Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
+In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
+This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
+This expectation of \at independence means the scheduler is expected to offer two guarantees:
 \begin{enumerate}
         \item A fairness guarantee: a \gls{thrd} that is ready to run is not prevented by another thread.
         \item A performance guarantee: a \gls{thrd} that wants to start or stop running is not prevented by other threads wanting to do the same.
+        \item A fairness guarantee: a \at that is ready to run is not prevented by another thread.
+        \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.
 \end{enumerate}
 It is important to note that these guarantees are expected only up to a point.
 \Glspl{thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
 Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
+\Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
+Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point.
 …
 For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved.
 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale.
 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general load of the system.
+With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system.
 \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal}
 …
 For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness.
 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available.
+Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.
 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling.
 External locality is a much more complicated subject and is discussed in the next section.
 …
         \input{fairness.pstex_t}
         \vspace*{-10pt}
         \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \gls{thrd} awaits running is shown as the time the ready \gls{thrd} waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.
         At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model.
+        \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \at awaits running is shown as the time the ready \at waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.
+        At the same time, the need for fairness increases since other \ats may have the chance to run many times, breaking the fairness model.
         Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.}
         \label{fig:fair}
 …
 \subsubsection{Migration Cost}
 Another important source of scheduling latency is migration.
+Another important source of scheduling latency is \glslink{atmig}{migration}.
 A \at migrates if it executes on two different \procs consecutively, which is the process discussed in \ref{fairnessvlocal}.
 Migrations can have many different causes, but in certain programs, it can be impossible to limit migration.
 …
 To compare subqueues, the timestamp at the head must be compared to the current time, yielding the best-case wait-time for the \at at the head of the queue.
 This new waiting is averaged with the stored average.
 To further limit migration, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.
+To further limit \glslink{atmig}{migrations}, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.
 Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
 …
 The good news is that this problem can be mitigated
 \subsection{Redundant Timestamps}\ref{relaxedtimes}
+\subsection{Redundant Timestamps}\label{relaxedtimes}
 The problem with polling remote subqueues is that correctness is critical.
 There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion.

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

-                      r7a0f798b
+                      ra44514e
 Threads that are not currently dealing with another request ignore the incoming packet.
 One of the remaining, nonbusy, threads reads the request and sends the response.
 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
+This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
 \end{itemize}
 Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
 …
 It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
 \item
+\todo{switch}
+Network routing is performed by a HP 2530 10 Gigabit Ethernet switch.
 \item
 A client machine runs two copies of the workload generator.

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

-                      r7a0f798b
+                      ra44514e
 The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
 Note, only the code of the \CFA tests is shown;
 all tests in the other systems are functionally identical and available online~\cite{SchedulingBenchmarks}.
+all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
 \section{Benchmark Environment}\label{microenv}
 …
 For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}.
 This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
 At runtime, each \at unparks the next \at before parking itself.
 Unparking the next \at pushes that \at onto the ready queue while the ensuing park leads to a \at being popped from the ready queue.
+At runtime, each \at unparks the next \at before \glslink{atblock}{parking} itself.
+Unparking the next \at pushes that \at onto the ready queue while the ensuing \park leads to a \at being popped from the ready queue.
 \begin{figure}
         \centering
         \input{cycle.pstex_t}
         \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before parking itself.}
+        \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before \glslink{atblock}{parking} itself.}
         \label{fig:cycle}
 \end{figure}
 Therefore, the underlying runtime cannot rely on the number of ready \ats staying constant over the duration of the experiment.
 In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at unparking and the current \at parking.
+In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at \glslink{atsched}{unparking} and the current \at \glslink{atblock}{parking}.
 That is, the runtime cannot anticipate that the current task immediately parks.
 As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \at parks because of time-slicing or multiple \procs.
 If this happens, the scheduler push and pop are avoided and the results of the experiment are skewed.
 (Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.)
+(Note, an \unpark is like a V on a semaphore, so the subsequent \park (P) may not block.)
 Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops.
 To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully park before being unparked again.
+To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully \park before being unparked again.
 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats.
 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
+There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw \park/\unpark and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
 \begin{figure}
 …
 An interesting aspect to note here is that the runtimes differ in how they handle this situation.
 Indeed, when a \proc unparks a \at that was last run on a different \proc, the \at could be appended to the ready queue of the local \proc or to the ready queue of the remote \proc, which previously ran the \at.
 \CFA, Tokio and Go all use the approach of unparking to the local \proc, while Libfibre unparks to the remote \proc.
+\CFA, Tokio and Go all use the approach of \glslink{atsched}{unparking} to the local \proc, while Libfibre unparks to the remote \proc.
 In this particular benchmark, the inherent chaos of the benchmark, in addition to small memory footprint, means neither approach wins over the other.
 …
 Up to 32 \procs, after which the other runtime manage to outscale Go.
 In conclusion, the objective of this benchmark is to demonstrate that unparking \ats from remote \procs does not cause too much contention on the local queues.
+In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues.
 Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstrate migrations do not need to be serialized.
 Again these result demonstrate \CFA achieves satisfactory performance with respect to the other runtimes.
 …
 \section{Locality}
 As mentioned in the churn benchmark, when unparking a \at, it is possible to either unpark to the local or remote ready-queue.\footnote{
 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
+As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{
+It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
 The locality experiment includes two variations of the churn benchmark, where a data array is added.
 In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
 …
 Figure~\ref{fig:locality:code} shows pseudo code for this benchmark.
 The objective here is to highlight the different decision made by the runtime when unparking.
+The objective here is to highlight the different decision made by the runtime when \glslink{atsched}{unparking}.
 Since each thread unparks a random semaphore, it means that it is unlikely that a \at is unparked from the last \proc it ran on.
 In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
 In the shared variation, unparking the \at on a remote \proc is an appropriate choice.
 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy.
+In the noshare variation, \glslink{atsched}{unparking} the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
+In the shared variation, \glslink{atsched}{unparking} the \at on a remote \proc is an appropriate choice.
+The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their \glslink{atsched}{unparking} policy.
 This decision should lead to \CFA, Go and Tokio achieving better performance in the share variation while libfibre achieves better performance in noshare.
 Indeed, \CFA, Go and Tokio have the default policy of unparking \ats on the local \proc, where as libfibre has the default policy of unparking \ats wherever they last ran.
+Indeed, \CFA, Go and Tokio have the default policy of \glslink{atsched}{unparking} \ats on the local \proc, where as libfibre has the default policy of \glslink{atsched}{unparking} \ats wherever they last ran.
 \begin{figure}
 …
 \vrule
 \hspace{3pt}
 \subfloat[Share]{\label{fig:locality:code:T1}\usebox\myboxB}
+\subfloat[Share]{\label{fig:locality:code:T2}\usebox\myboxB}
 \caption[Locality Benchmark : Pseudo Code]{Locality Benchmark : Pseudo Code}
 …
 Looking at the left column on Intel, Figures~\ref{fig:locality:jax:share:ops} and \ref{fig:locality:jax:share:ns} show the results for the share variation.
 \CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach.
 \CFA and Tokio both unpark locally and do not suffer cache misses on the transferred array.
+\CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array.
 Libfibre on the other hand unparks remotely, and as such the unparked \at is likely to miss on the shared data.
 Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark.
 …
 Indeed, in this case, unparking remotely means the unparked \at is less likely to suffer a cache miss on the array, which leaves the \at data structure and the remote queue as the only source of likely cache misses.
 Results show both are amortized fairly well in this case.
 \CFA and Tokio both unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array.
+\CFA and Tokio both \unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array.
 Looking at the results for the AMD architecture, Figure~\ref{fig:locality:nasus}, shows results similar to the Intel.
 …
 Go still has the same poor performance.
 Overall, this benchmark mostly demonstrates the two options available when unparking a \at.
+Overall, this benchmark mostly demonstrates the two options available when \glslink{atsched}{unparking} a \at.
 Depending on the workload, either of these options can be the appropriate one.
 Since it is prohibitively difficult to dynamically detect which approach is appropriate, all runtimes much choose one of the two and live with the consequences.

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

-                      r7a0f798b
+                      ra44514e
 Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
 Note, this description includes both information about each requests, \eg time to complete or resources needed, and information about the relationships among request, \eg whether or not some request must be completed before another request starts.
+This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well-informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
+Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among request, \eg whether some request must be completed before another request starts.
 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling, since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods of time.
 …
 \newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all.
 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
 This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies.
 Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
+This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies.
+Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
 As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
 \subsection{Explicitly Informed Dynamic Schedulers}
 While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.
+While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc.
 When available, a scheduler can then use this information to direct the scheduling decisions.
 For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
 However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
 Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.
+at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \at takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
+Providing this kind of information is a significant programmer burden, especially if the information does not scale with the number of \ats and their complexity.
 For example, providing an exhaustive list of files read by 5 \ats is an easier requirement then providing an exhaustive list of memory addresses accessed by 10,000 independent \ats.
 …
 \subsubsection{Priority Scheduling}
 Common information used by schedulers to direct their algorithm is priorities.
 Each \ats is given a priority and higher-priority \ats are preferred to lower-priority ones.
 The simplest priority scheduling algorithm is to require that every \ats have a distinct pre-established priority and always run the available \ats with the highest priority.
+Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones.
+The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority.
 Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \ats.
 It can therefore be desirable for schedulers to support \ats with identical priorities and/or automatically setting and adjusting priorities for \ats.
 …
 \subsection{Uninformed and Self-Informed Dynamic Schedulers}
 Several scheduling algorithms do not require programmers to provide additional information on each \ats, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
+Several scheduling algorithms do not require programmers to provide additional information on each \at, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
 \subsubsection{Feedback Scheduling}
 As mentioned, schedulers may also gather information about each \ats to direct their decisions.
+As mentioned, schedulers may also gather information about each \at to direct their decisions.
 This design effectively moves the scheduler into the realm of \newterm{Control Theory}~\cite{wiki:controltheory}.
 This information gathering does not generally involve programmers, and as such, does not increase programmer burden the same way explicitly provided information may.
 However, some feedback schedulers do allow programmers to offer additional information on certain \ats, in order to direct scheduling decisions.
 The important distinction being whether or not the scheduler can function without this additional information.
+The important distinction being whether the scheduler can function without this additional information.
 \section{Work Stealing}\label{existing:workstealing}
 One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work stealing.
 This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.
 \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm had trees of \ats and steal only among neighbours.}.
+This idea, introduced by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.
+\cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each worker has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm has trees of \ats and steals only among neighbours.}.
 Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations.
 …
 In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
 An common example where this is statically true is architectures with \acrshort{numa}.
 In these cases, it can be relevant to change the scheduler to be cognizent of the topology~\cite{vikranth2013topology,min2011hierarchical}.
+A common example where this is statically true is architectures with \glsxtrshort{numa}.
+In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}.
 Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}.
 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures.
 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
 As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis.
 Althought it could be an interesting avenue for future work.
+This is arguably strongly related to Task Placement, but extends into more heterogeneous architectures.
+As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis.
+Although it could be an interesting avenue for future work.
 \subsection{Theoretical Results}
 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
+There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
 Others show that work stealing is applicable to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
 …
 \section{Preemption}
 One last aspect of scheduling is preemption since many schedulers rely on it for some of their guarantees.
+One last aspect of scheduling is preemption, since many schedulers rely on it for some of their guarantees.
 Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
 There are multiple techniques to achieve this effect but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.
+There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.
 While this helps schedulers guarantee that no \ats unfairly monopolizes a worker, preemption can effectively be added to any scheduler.
 Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it.
+Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it.
 \section{Production Schedulers}\label{existing:prod}
 …
 The default scheduler used by Linux, the Completely Fair Scheduler~\cite{MAN:linux/cfs,MAN:linux/cfs2}, is a feedback scheduler based on CPU time.
 For each processor, it constructs a Red-Black tree of \ats waiting to run, ordering them by the amount of CPU time used.
 The \ats that has used the least CPU time is scheduled.
+The \at that has used the least CPU time is scheduled.
 It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time used.
 The ordering of \ats is also affected by a group based notion of fairness, where \ats belonging to groups having used less CPU time are preferred to \ats belonging to groups having used more CPU time.
 Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the load, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
 \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly.
 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \ats and the other with one thousand \ats, the user with a single \ats does not receive one thousandth of the CPU time.}, increasing the complexity.
+Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the \gls{load}, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
+\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each \proc, but the paper argues this aspect can be improved significantly.
+The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one thousandth of the CPU time.}, increasing the complexity.
 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
 …
 Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
 It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
 It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \ats has used.
+It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \at has used.
 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
 …
 Erlang is a functional language that supports concurrency in the form of processes: threads that share no data.
 It uses a kind of round-robin scheduler, with a mix of work sharing and stealing to achieve load balancing~\cite{:erlang}, where under-loaded workers steal from other workers, but overloaded workers also push work to other workers.
 This migration logic is directed by monitoring logic that evaluates the load a few times per seconds.
+This \glslink{atmig}{migration} logic is directed by monitoring logic that evaluates the load a few times per seconds.
 \paragraph{Intel\textregistered ~Threading Building Blocks}
 \newterm{Thread Building Blocks} (TBB) is Intel's task parallelism \cite{wiki:taskparallel} framework.
 It runs \newterm{jobs}, which are uninterruptable \ats that must always run to completion, on a pool of worker threads.
+It runs \newterm{jobs}, which are uninterruptible \ats that must always run to completion, on a pool of worker threads.
 TBB's scheduler is a variation of randomized work-stealing that also supports higher-priority graph-like dependencies~\cite{MAN:tbb/scheduler}.
 It schedules \ats as follows (where \textit{t} is the last \ats completed):
+It schedules \ats as follows (where \textit{t} is the last \at completed):
 \begin{displayquote}
         \begin{enumerate}

doc/theses/thierry_delisle_PhD/thesis/text/front.tex

-                      r7a0f798b
+                      ra44514e
 User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
+The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
 …
 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for user-level threading.
 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
+Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
 …
 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
 The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using @select@, @epoll@, \etc.
 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
+To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
 \cleardoublepage

doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

-                      r7a0f798b
+                      ra44514e
 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
+The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
 which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
+which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
 …
 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}.
 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
+Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
+The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
 Chapter~\ref{intro} defines scheduling and its general goals.
 Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others.
 Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation.
+Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant effect on the new scheduler design and implementation.
 Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness.
 Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}.
 …
 \section{Scheduling}\label{sched}
 Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}.
+On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}.
 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
 …
 However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
 Scheduling occurs at discreet points when there are transitions in a system.
 For example, a thread cycles through the following transitions during its execution.
+Scheduling occurs at discrete points when there are transitions in a system.
+For example, a \at cycles through the following transitions during its execution.
 \begin{center}
 \input{executionStates.pstex_t}
 …
 entering the system (new $\rightarrow$ ready)
 \item
 scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running)
+scheduler assigns a \at to a computing resource, \eg CPU (ready $\rightarrow$ running)
 \item
 timer alarm for preemption (running $\rightarrow$ ready)
 …
 normal completion or error, \eg segment fault (running $\rightarrow$ halted)
 \end{itemize}
 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
+Key to scheduling is that a \at cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
 When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
 …
 \end{tabular}
 \end{center}
 Beyond these two schedulers are a host of options, \eg adding an global shared queue to MQMS or adding multiple private queues with distinc characteristics.
+Beyond these two schedulers are a host of options, \eg adding a global shared queue to MQMS or adding multiple private queues with distinct characteristics.
 Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria:
 …
 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
 That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors.
+That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors.
 \item
 …
 More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
+To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-                      r7a0f798b
+                      ra44514e
 \chapter{User Level \io}\label{userio}
 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
+As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
 …
 It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
 For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
 This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
+This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
 There are three options to monitor file descriptors in Linux:\footnote{
 …
 Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
 \begin{comment}
 From: Tim Brecht <brecht@uwaterloo.ca>
 Subject: Re: FD sets
 Date: Wed, 6 Jul 2022 00:29:41 +0000
 Large number of open files
 --------------------------
 In order to be able to use more than the default number of open file
 descriptors you may need to:
 o increase the limit on the total number of open files /proc/sys/fs/file-max
   (on Linux systems)
 o increase the size of FD_SETSIZE
   - the way I often do this is to figure out which include file __FD_SETSIZE
     is defined in, copy that file into an appropriate directory in ./include,
     and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
     gets used
   For example on a RH 9.0 distribution I've copied
   /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
   Then I modify typesizes.h to look something like:
   #ifdef BIGGER_FD_SETSIZE
   #define __FD_SETSIZE            32767
   #else
   #define __FD_SETSIZE            1024
   #endif
   Note that the since I'm moving and testing the userver on may different
   machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
   This way if you redefine the FD_SETSIZE it will get used instead of the
   default original file.
 \end{comment}
+% \begin{comment}
+% From: Tim Brecht <brecht@uwaterloo.ca>
+% Subject: Re: FD sets
+% Date: Wed, 6 Jul 2022 00:29:41 +0000
+% Large number of open files
+% --------------------------
+% In order to be able to use more than the default number of open file
+% descriptors you may need to:
+% o increase the limit on the total number of open files /proc/sys/fs/file-max
+%   (on Linux systems)
+% o increase the size of FD_SETSIZE
+%   - the way I often do this is to figure out which include file __FD_SETSIZE
+%     is defined in, copy that file into an appropriate directory in ./include,
+%     and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
+%     gets used
+%   For example on a RH 9.0 distribution I've copied
+%   /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
+%   Then I modify typesizes.h to look something like:
+%   #ifdef BIGGER_FD_SETSIZE
+%   #define __FD_SETSIZE            32767
+%   #else
+%   #define __FD_SETSIZE            1024
+%   #endif
+%   Note that the since I'm moving and testing the userver on may different
+%   machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
+%   This way if you redefine the FD_SETSIZE it will get used instead of the
+%   default original file.
+% \end{comment}
 \paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
 …
 \subsection{Extra Kernel Threads}\label{io:morethreads}
 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
+In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading.
+However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run.
 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
 …
 \section{Event-Engine}
 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
 In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
 The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
+In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
+The parked \ats are then rescheduled by the event engine once the desired operation has completed.
 \subsection{\lstinline{io_uring} in depth}\label{iouring}
 …
 \subsubsection{Private Instances}
 The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
 This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps.
 This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
+This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps.
+This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
 This failure is the serially reusable problem~\cite{SeriallyReusable}.
 Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
 To remove this requirement, a \gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
+To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
 Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, \etc.
+Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
 \begin{figure}
 …
 This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
 However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly.
 Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations.
 In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
+However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
+Instances running out of SQEs cannot run \ats wanting to do \io operations.
+In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
 \Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
 While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
 Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set.
 Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.
 In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
 No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
 However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock.
 While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise ``in the wild''.
+\ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
+While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
+Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set.
+Assume both \ats are running on the same \gls{proc}, and the \io \at is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \at.
+In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
+No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
+However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
+While this example is artificial, in the presence of many \ats, it is possible for this problem to arise ``in the wild''.
 Furthermore, this pattern is difficult to reliably detect and avoid.
 Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption, \eg time slicing.
 Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
+Once in this situation, the only escape is to interrupted the spinning \at, either directly or via some regular preemption, \eg time slicing.
+Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
 Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
 Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it.
+Therefore, a more satisfying solution is for the \at submitting the operation to notice that the instance is unused and simply go ahead and use it.
 This approach is presented shortly.
 \subsubsection{Public Instances}
 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
 \Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance.
 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
+\ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
+Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
 \begin{itemize}
 …
 The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
 Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
+Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
 Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
 Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to.
+Once an SQE is allocated, \ats insert the \io request information, and keep track of the SQE index and the instance it belongs to.
 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
 …
 Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
 Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
 Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
 Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}.
+Balancing submission can be handled by either designating one of the submitting \ats as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \at mentioned later in this section.
+Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}.
 However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
 Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request.
+Indeed, as long as there is a ``next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request.
 Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
 Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
+Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
 Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
 A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
+A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
 With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
 It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
+It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
 The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
+The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
 All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
 …
 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
 When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
 This instance is now bound to the \gls{proc} the \gls{thrd} is running on.
+When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
+This instance is now bound to the \gls{proc} the \at is running on.
 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
 This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
 …
         \item The current \gls{proc} does not hold an instance.
         \item The current instance does not have sufficient SQEs to satisfy the request.
         \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}.
+        \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
 \end{enumerate}
 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

-                      r7a0f798b
+                      ra44514e
 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
 The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
+The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
 …
 \subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@.
+The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe \park/\unpark of a \gls{kthrd} to/from a @pthread_cond@.
 While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s.
 For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc.
 …
 \subsection{Event FDs}
 Another interesting approach is to use an event file descriptor\cite{eventfd}.
+Another interesting approach is to use an event file descriptor\cite{MAN:eventfd}.
 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.

doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

-                      r7a0f798b
+                      ra44514e
 \section{M:N Threading}\label{prev:model}
 Threading in \CFA is based on \Gls{uthrding}, where \glspl{thrd} are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \glspl{thrd} and switch among \glspl{thrd} liberally without many concerns for performance.
+Threading in \CFA is based on \Gls{uthrding}, where \ats are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \ats and switch among \ats liberally without many concerns for performance.
 The \CFA M:N threading models is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
 The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \gls{thrd} until it context switches out, it then chooses a different \gls{thrd} to run.
+The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run.
 \section{Clusters}
 \CFA allows the option to group user-level threading, in the form of clusters.
 Both \glspl{thrd} and \glspl{proc} belong to a specific cluster.
 \Glspl{thrd} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
+Both \ats and \glspl{proc} belong to a specific cluster.
+\Glspl{at} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
 Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism.
 It also opens the door to handling effects like NUMA, by pinning clusters to a specific NUMA node\footnote{This capability is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for CPU masks.}.
 …
                 \input{system.pstex_t}
         \end{center}
         \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{thrd} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \glspl{thrd}.}
+        \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{at} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \ats.}
         \label{fig:system}
 \end{figure}
 …
 \section{\glsxtrshort{io}}\label{prev:io}
 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \glspl{thrd}. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \glspl{thrd}. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \gls{thrd} is ready to run. A simple example of this type of deadlock would be as follows:
+Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
 \begin{quote}
 Given a simple network program with 2 \glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd} waits for a response from the server.
 If the second \gls{thrd} races ahead, it may wait for responses to requests that have not been sent yet.
 In theory, this should not be a problem, even if the second \gls{thrd} waits, because the first \gls{thrd} is still ready to run and should be able to get CPU time to send the request.
 With M:N threading, while the first \gls{thrd} is ready, the lone \gls{proc} \emph{cannot} run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}.
+Given a simple network program with 2 \ats and a single \gls{proc}, one \at sends network requests to a server and the other \at waits for a response from the server.
+If the second \at races ahead, it may wait for responses to requests that have not been sent yet.
+In theory, this should not be a problem, even if the second \at waits, because the first \at is still ready to run and should be able to get CPU time to send the request.
+With M:N threading, while the first \at is ready, the lone \gls{proc} \emph{cannot} run the first \at if it is blocked in the \glsxtrshort{io} operation of the second \at.
 If this happen, the system is in a synchronization deadlock\footnote{In this example, the deadlock could be resolved if the server sends unprompted messages to the client.
 However, this solution is neither general nor appropriate even in this simple case.}.
 \end{quote}
 Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} ope      rations.
 This feature entails multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc}.
+Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
+This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}.
 The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
 This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: