Ignore:
Timestamp:
Sep 7, 2022, 4:12:00 PM (20 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
e4855f6
Parents:
7a0f798b
Message:

A whole bunch of small changes:
trying to setup a version that I can pass through a spell checker.
Fixing a whole bunch of grammar errors

Location:
doc/theses/thierry_delisle_PhD/thesis/text
Files:
10 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r7a0f798b ra44514e  
    113113In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire.
    114114
    115 Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
     115Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
    116116However, I believe this feature is generally aimed at large groups of instructions.
    117117A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r7a0f798b ra44514e  
    22
    33Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state.
    4 For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.
     4For this purpose, a steady state is loosely defined as a state where there are always \ats ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers.
    55In short, the system is neither overloaded nor underloaded.
    66
    77It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected.
    8 As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new load and return to the steady state, \eg, by adding or removing workers.
     8As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new \gls{load} and return to the steady state, \eg, by adding or removing workers.
    99Therefore, flaws in scheduling the steady state tend to be pervasive in all states.
    1010
     
    2020\end{displayquote}
    2121
    22 Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
    23 
    24 In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with each other but simply share the hardware.
    25 This assumption makes it easier to reason about threading because ready \glspl{thrd} can be thought of in isolation and the effect of the scheduler can be virtually ignored.
    26 This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
     22Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
     23
     24In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
     25This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
     26This expectation of \at independence means the scheduler is expected to offer two guarantees:
    2727\begin{enumerate}
    28         \item A fairness guarantee: a \gls{thrd} that is ready to run is not prevented by another thread.
    29         \item A performance guarantee: a \gls{thrd} that wants to start or stop running is not prevented by other threads wanting to do the same.
     28        \item A fairness guarantee: a \at that is ready to run is not prevented by another thread.
     29        \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.
    3030\end{enumerate}
    3131
    3232It is important to note that these guarantees are expected only up to a point.
    33 \Glspl{thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
    34 Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
     33\Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
     34Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
    3535
    3636Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point.
     
    5959For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved.
    6060Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale.
    61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general load of the system.
     61With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system.
    6262
    6363\subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal}
     
    6868
    6969For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness.
    70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available.
     70Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.
    7171Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling.
    7272External locality is a much more complicated subject and is discussed in the next section.
     
    8080        \input{fairness.pstex_t}
    8181        \vspace*{-10pt}
    82         \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \gls{thrd} awaits running is shown as the time the ready \gls{thrd} waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.
    83         At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model.
     82        \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \at awaits running is shown as the time the ready \at waits increases, Ready Time, the chances that its data is still in cache decreases, Locality.
     83        At the same time, the need for fairness increases since other \ats may have the chance to run many times, breaking the fairness model.
    8484        Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.}
    8585        \label{fig:fair}
     
    9797
    9898\subsubsection{Migration Cost}
    99 Another important source of scheduling latency is migration.
     99Another important source of scheduling latency is \glslink{atmig}{migration}.
    100100A \at migrates if it executes on two different \procs consecutively, which is the process discussed in \ref{fairnessvlocal}.
    101101Migrations can have many different causes, but in certain programs, it can be impossible to limit migration.
     
    250250To compare subqueues, the timestamp at the head must be compared to the current time, yielding the best-case wait-time for the \at at the head of the queue.
    251251This new waiting is averaged with the stored average.
    252 To further limit migration, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.
     252To further limit \glslink{atmig}{migrations}, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average.
    253253Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
    254254
     
    261261The good news is that this problem can be mitigated
    262262
    263 \subsection{Redundant Timestamps}\ref{relaxedtimes}
     263\subsection{Redundant Timestamps}\label{relaxedtimes}
    264264The problem with polling remote subqueues is that correctness is critical.
    265265There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r7a0f798b ra44514e  
    4949Threads that are not currently dealing with another request ignore the incoming packet.
    5050One of the remaining, nonbusy, threads reads the request and sends the response.
    51 This implementation can lead to increased CPU load as threads wake from sleep to potentially process the request.
     51This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
    5252\end{itemize}
    5353Here, Memcached is based on an event-based webserver architecture~\cite{Pai99Flash}, using \gls{kthrd}ing to run multiple largely independent event engines, and if needed, spinning up additional kernel threads to handle blocking I/O.
     
    273273It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
    274274\item
    275 \todo{switch}
     275Network routing is performed by a HP 2530 10 Gigabit Ethernet switch.
    276276\item
    277277A client machine runs two copies of the workload generator.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r7a0f798b ra44514e  
    66The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
    77Note, only the code of the \CFA tests is shown;
    8 all tests in the other systems are functionally identical and available online~\cite{SchedulingBenchmarks}.
     8all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
    99
    1010\section{Benchmark Environment}\label{microenv}
     
    6363For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}.
    6464This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
    65 At runtime, each \at unparks the next \at before parking itself.
    66 Unparking the next \at pushes that \at onto the ready queue while the ensuing park leads to a \at being popped from the ready queue.
     65At runtime, each \at unparks the next \at before \glslink{atblock}{parking} itself.
     66Unparking the next \at pushes that \at onto the ready queue while the ensuing \park leads to a \at being popped from the ready queue.
    6767
    6868\begin{figure}
    6969        \centering
    7070        \input{cycle.pstex_t}
    71         \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before parking itself.}
     71        \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before \glslink{atblock}{parking} itself.}
    7272        \label{fig:cycle}
    7373\end{figure}
    7474
    7575Therefore, the underlying runtime cannot rely on the number of ready \ats staying constant over the duration of the experiment.
    76 In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at unparking and the current \at parking.
     76In fact, the total number of \ats waiting on the ready queue is expected to vary because of the race between the next \at \glslink{atsched}{unparking} and the current \at \glslink{atblock}{parking}.
    7777That is, the runtime cannot anticipate that the current task immediately parks.
    7878As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \at parks because of time-slicing or multiple \procs.
    7979If this happens, the scheduler push and pop are avoided and the results of the experiment are skewed.
    80 (Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.)
     80(Note, an \unpark is like a V on a semaphore, so the subsequent \park (P) may not block.)
    8181Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops.
    82 To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully park before being unparked again.
     82To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully \park before being unparked again.
    8383Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
    8484
    8585Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats.
    86 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
     86There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw \park/\unpark and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
    8787
    8888\begin{figure}
     
    416416An interesting aspect to note here is that the runtimes differ in how they handle this situation.
    417417Indeed, when a \proc unparks a \at that was last run on a different \proc, the \at could be appended to the ready queue of the local \proc or to the ready queue of the remote \proc, which previously ran the \at.
    418 \CFA, Tokio and Go all use the approach of unparking to the local \proc, while Libfibre unparks to the remote \proc.
     418\CFA, Tokio and Go all use the approach of \glslink{atsched}{unparking} to the local \proc, while Libfibre unparks to the remote \proc.
    419419In this particular benchmark, the inherent chaos of the benchmark, in addition to small memory footprint, means neither approach wins over the other.
    420420
     
    485485Up to 32 \procs, after which the other runtime manage to outscale Go.
    486486
    487 In conclusion, the objective of this benchmark is to demonstrate that unparking \ats from remote \procs does not cause too much contention on the local queues.
     487In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues.
    488488Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstrate migrations do not need to be serialized.
    489489Again these result demonstrate \CFA achieves satisfactory performance with respect to the other runtimes.
     
    491491\section{Locality}
    492492
    493 As mentioned in the churn benchmark, when unparking a \at, it is possible to either unpark to the local or remote ready-queue.\footnote{
    494 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
     493As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{
     494It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
    495495The locality experiment includes two variations of the churn benchmark, where a data array is added.
    496496In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
     
    499499Figure~\ref{fig:locality:code} shows pseudo code for this benchmark.
    500500
    501 The objective here is to highlight the different decision made by the runtime when unparking.
     501The objective here is to highlight the different decision made by the runtime when \glslink{atsched}{unparking}.
    502502Since each thread unparks a random semaphore, it means that it is unlikely that a \at is unparked from the last \proc it ran on.
    503 In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
    504 In the shared variation, unparking the \at on a remote \proc is an appropriate choice.
    505 
    506 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy.
     503In the noshare variation, \glslink{atsched}{unparking} the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
     504In the shared variation, \glslink{atsched}{unparking} the \at on a remote \proc is an appropriate choice.
     505
     506The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their \glslink{atsched}{unparking} policy.
    507507This decision should lead to \CFA, Go and Tokio achieving better performance in the share variation while libfibre achieves better performance in noshare.
    508 Indeed, \CFA, Go and Tokio have the default policy of unparking \ats on the local \proc, where as libfibre has the default policy of unparking \ats wherever they last ran.
     508Indeed, \CFA, Go and Tokio have the default policy of \glslink{atsched}{unparking} \ats on the local \proc, where as libfibre has the default policy of \glslink{atsched}{unparking} \ats wherever they last ran.
    509509
    510510\begin{figure}
     
    554554\vrule
    555555\hspace{3pt}
    556 \subfloat[Share]{\label{fig:locality:code:T1}\usebox\myboxB}
     556\subfloat[Share]{\label{fig:locality:code:T2}\usebox\myboxB}
    557557
    558558\caption[Locality Benchmark : Pseudo Code]{Locality Benchmark : Pseudo Code}
     
    566566Looking at the left column on Intel, Figures~\ref{fig:locality:jax:share:ops} and \ref{fig:locality:jax:share:ns} show the results for the share variation.
    567567\CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach.
    568 \CFA and Tokio both unpark locally and do not suffer cache misses on the transferred array.
     568\CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array.
    569569Libfibre on the other hand unparks remotely, and as such the unparked \at is likely to miss on the shared data.
    570570Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark.
     
    640640Indeed, in this case, unparking remotely means the unparked \at is less likely to suffer a cache miss on the array, which leaves the \at data structure and the remote queue as the only source of likely cache misses.
    641641Results show both are amortized fairly well in this case.
    642 \CFA and Tokio both unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array.
     642\CFA and Tokio both \unpark locally and as a result suffer a marginal performance degradation from the cache miss on the array.
    643643
    644644Looking at the results for the AMD architecture, Figure~\ref{fig:locality:nasus}, shows results similar to the Intel.
     
    651651Go still has the same poor performance.
    652652
    653 Overall, this benchmark mostly demonstrates the two options available when unparking a \at.
     653Overall, this benchmark mostly demonstrates the two options available when \glslink{atsched}{unparking} a \at.
    654654Depending on the workload, either of these options can be the appropriate one.
    655655Since it is prohibitively difficult to dynamically detect which approach is appropriate, all runtimes much choose one of the two and live with the consequences.
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r7a0f798b ra44514e  
    77Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
    88A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
    9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
    10 Note, this description includes both information about each requests, \eg time to complete or resources needed, and information about the relationships among request, \eg whether or not some request must be completed before another request starts.
     9This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well-informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
     10Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among request, \eg whether some request must be completed before another request starts.
    1111
    1212Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling, since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods of time.
     
    2929\newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all.
    3030Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime.
    31 This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies.
    32 Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
     31This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies.
     32Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
    3333As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
    3434
    3535\subsection{Explicitly Informed Dynamic Schedulers}
    36 While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.
     36While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc.
    3737When available, a scheduler can then use this information to direct the scheduling decisions.
    3838For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
    3939However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
    40 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
    41 Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.
     40at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \at takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
     41Providing this kind of information is a significant programmer burden, especially if the information does not scale with the number of \ats and their complexity.
    4242For example, providing an exhaustive list of files read by 5 \ats is an easier requirement then providing an exhaustive list of memory addresses accessed by 10,000 independent \ats.
    4343
     
    4646\subsubsection{Priority Scheduling}
    4747Common information used by schedulers to direct their algorithm is priorities.
    48 Each \ats is given a priority and higher-priority \ats are preferred to lower-priority ones.
    49 The simplest priority scheduling algorithm is to require that every \ats have a distinct pre-established priority and always run the available \ats with the highest priority.
     48Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones.
     49The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority.
    5050Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \ats.
    5151It can therefore be desirable for schedulers to support \ats with identical priorities and/or automatically setting and adjusting priorities for \ats.
     
    5454
    5555\subsection{Uninformed and Self-Informed Dynamic Schedulers}
    56 Several scheduling algorithms do not require programmers to provide additional information on each \ats, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
     56Several scheduling algorithms do not require programmers to provide additional information on each \at, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler.
    5757
    5858
    5959\subsubsection{Feedback Scheduling}
    60 As mentioned, schedulers may also gather information about each \ats to direct their decisions.
     60As mentioned, schedulers may also gather information about each \at to direct their decisions.
    6161This design effectively moves the scheduler into the realm of \newterm{Control Theory}~\cite{wiki:controltheory}.
    6262This information gathering does not generally involve programmers, and as such, does not increase programmer burden the same way explicitly provided information may.
    6363However, some feedback schedulers do allow programmers to offer additional information on certain \ats, in order to direct scheduling decisions.
    64 The important distinction being whether or not the scheduler can function without this additional information.
     64The important distinction being whether the scheduler can function without this additional information.
    6565
    6666
    6767\section{Work Stealing}\label{existing:workstealing}
    6868One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work stealing.
    69 This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.
    70 \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm had trees of \ats and steal only among neighbours.}.
     69This idea, introduced by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker process its local \ats first, but allows the possibility for other workers to steal local \ats if they run out of \ats.
     70\cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each worker has a queue of \ats and workers without \ats steal \ats from random workers\footnote{The Burton and Sleep algorithm has trees of \ats and steals only among neighbours.}.
    7171Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations.
    7272
     
    8282In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
    8383However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
    84 An common example where this is statically true is architectures with \acrshort{numa}.
    85 In these cases, it can be relevant to change the scheduler to be cognizent of the topology~\cite{vikranth2013topology,min2011hierarchical}.
     84A common example where this is statically true is architectures with \glsxtrshort{numa}.
     85In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}.
    8686Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}.
    8787
    8888\paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures.
    89 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
    90 As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis.
    91 Althought it could be an interesting avenue for future work.
     89This is arguably strongly related to Task Placement, but extends into more heterogeneous architectures.
     90As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis.
     91Although it could be an interesting avenue for future work.
    9292
    9393\subsection{Theoretical Results}
    94 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
     94There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
    9595\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
    9696Others show that work stealing is applicable to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
     
    101101
    102102\section{Preemption}
    103 One last aspect of scheduling is preemption since many schedulers rely on it for some of their guarantees.
     103One last aspect of scheduling is preemption, since many schedulers rely on it for some of their guarantees.
    104104Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
    105 There are multiple techniques to achieve this effect but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.
     105There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \ats are never further apart than some fixed duration.
    106106While this helps schedulers guarantee that no \ats unfairly monopolizes a worker, preemption can effectively be added to any scheduler.
    107 Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it.
     107Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it.
    108108
    109109\section{Production Schedulers}\label{existing:prod}
     
    122122The default scheduler used by Linux, the Completely Fair Scheduler~\cite{MAN:linux/cfs,MAN:linux/cfs2}, is a feedback scheduler based on CPU time.
    123123For each processor, it constructs a Red-Black tree of \ats waiting to run, ordering them by the amount of CPU time used.
    124 The \ats that has used the least CPU time is scheduled.
     124The \at that has used the least CPU time is scheduled.
    125125It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time used.
    126126The ordering of \ats is also affected by a group based notion of fairness, where \ats belonging to groups having used less CPU time are preferred to \ats belonging to groups having used more CPU time.
    127 Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the load, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
    128 
    129 \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly.
    130 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \ats and the other with one thousand \ats, the user with a single \ats does not receive one thousandth of the CPU time.}, increasing the complexity.
     127Linux achieves load-balancing by regularly monitoring the system state~\cite{MAN:linux/cfs/balancing} and using some heuristic on the \gls{load}, currently CPU time used in the last millisecond plus a decayed version of the previous time slots~\cite{MAN:linux/cfs/pelt}.
     128
     129\cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work stealing to balance the workload of each \proc, but the paper argues this aspect can be improved significantly.
     130The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one thousandth of the CPU time.}, increasing the complexity.
    131131
    132132Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
     
    140140Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
    141141It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
    142 It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \ats has used.
     142It schedules \ats based on the highest priorities (lowest number) and how much CPU time each \at has used.
    143143The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
    144144
     
    184184Erlang is a functional language that supports concurrency in the form of processes: threads that share no data.
    185185It uses a kind of round-robin scheduler, with a mix of work sharing and stealing to achieve load balancing~\cite{:erlang}, where under-loaded workers steal from other workers, but overloaded workers also push work to other workers.
    186 This migration logic is directed by monitoring logic that evaluates the load a few times per seconds.
     186This \glslink{atmig}{migration} logic is directed by monitoring logic that evaluates the load a few times per seconds.
    187187
    188188\paragraph{Intel\textregistered ~Threading Building Blocks}
    189189\newterm{Thread Building Blocks} (TBB) is Intel's task parallelism \cite{wiki:taskparallel} framework.
    190 It runs \newterm{jobs}, which are uninterruptable \ats that must always run to completion, on a pool of worker threads.
     190It runs \newterm{jobs}, which are uninterruptible \ats that must always run to completion, on a pool of worker threads.
    191191TBB's scheduler is a variation of randomized work-stealing that also supports higher-priority graph-like dependencies~\cite{MAN:tbb/scheduler}.
    192 It schedules \ats as follows (where \textit{t} is the last \ats completed):
     192It schedules \ats as follows (where \textit{t} is the last \at completed):
    193193\begin{displayquote}
    194194        \begin{enumerate}
  • doc/theses/thierry_delisle_PhD/thesis/text/front.tex

    r7a0f798b ra44514e  
    124124
    125125User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
    126 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
     126The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
    127127Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    128128To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
     
    135135This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for user-level threading.
    136136The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
    137 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
     137Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
    138138Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    139139
     
    146146The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
    147147The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using @select@, @epoll@, \etc.
    148 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
     148To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
    149149
    150150\cleardoublepage
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r7a0f798b ra44514e  
    22
    33\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
    4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
     4The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
    55Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    66To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    7 which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     7which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    88Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    99When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
     
    1313This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}.
    1414The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
    15 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
     15Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
    1616Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    1717
    1818After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
    1919The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
    20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
     20The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
    2121
    2222Chapter~\ref{intro} defines scheduling and its general goals.
    2323Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others.
    24 Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation.
     24Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant effect on the new scheduler design and implementation.
    2525Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness.
    2626Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}.
     
    3131\section{Scheduling}\label{sched}
    3232Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
    33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}.
     33On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}.
    3434Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
    3535In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
     
    3939However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
    4040
    41 Scheduling occurs at discreet points when there are transitions in a system.
    42 For example, a thread cycles through the following transitions during its execution.
     41Scheduling occurs at discrete points when there are transitions in a system.
     42For example, a \at cycles through the following transitions during its execution.
    4343\begin{center}
    4444\input{executionStates.pstex_t}
     
    4949entering the system (new $\rightarrow$ ready)
    5050\item
    51 scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running)
     51scheduler assigns a \at to a computing resource, \eg CPU (ready $\rightarrow$ running)
    5252\item
    5353timer alarm for preemption (running $\rightarrow$ ready)
     
    5959normal completion or error, \eg segment fault (running $\rightarrow$ halted)
    6060\end{itemize}
    61 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
     61Key to scheduling is that a \at cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
    6262
    6363When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
     
    7171\end{tabular}
    7272\end{center}
    73 Beyond these two schedulers are a host of options, \eg adding an global shared queue to MQMS or adding multiple private queues with distinc characteristics.
     73Beyond these two schedulers are a host of options, \eg adding a global shared queue to MQMS or adding multiple private queues with distinct characteristics.
    7474
    7575Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria:
     
    8686Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
    8787When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
    88 That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors.
     88That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors.
    8989
    9090\item
     
    118118More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
    119119The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
    120 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
     120To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
    121121
    122122As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r7a0f798b ra44514e  
    11\chapter{User Level \io}\label{userio}
    2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
     2As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
    33Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
    44
     
    1414It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
    1515For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
    16 This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
     16This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
    1717
    1818There are three options to monitor file descriptors in Linux:\footnote{
     
    3333Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
    3434
    35 \begin{comment}
    36 From: Tim Brecht <brecht@uwaterloo.ca>
    37 Subject: Re: FD sets
    38 Date: Wed, 6 Jul 2022 00:29:41 +0000
    39 
    40 Large number of open files
    41 --------------------------
    42 
    43 In order to be able to use more than the default number of open file
    44 descriptors you may need to:
    45 
    46 o increase the limit on the total number of open files /proc/sys/fs/file-max
    47   (on Linux systems)
    48 
    49 o increase the size of FD_SETSIZE
    50   - the way I often do this is to figure out which include file __FD_SETSIZE
    51     is defined in, copy that file into an appropriate directory in ./include,
    52     and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
    53     gets used
    54 
    55   For example on a RH 9.0 distribution I've copied
    56   /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
    57 
    58   Then I modify typesizes.h to look something like:
    59 
    60   #ifdef BIGGER_FD_SETSIZE
    61   #define __FD_SETSIZE            32767
    62   #else
    63   #define __FD_SETSIZE            1024
    64   #endif
    65 
    66   Note that the since I'm moving and testing the userver on may different
    67   machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
    68 
    69   This way if you redefine the FD_SETSIZE it will get used instead of the
    70   default original file.
    71 \end{comment}
     35% \begin{comment}
     36% From: Tim Brecht <brecht@uwaterloo.ca>
     37% Subject: Re: FD sets
     38% Date: Wed, 6 Jul 2022 00:29:41 +0000
     39
     40% Large number of open files
     41% --------------------------
     42
     43% In order to be able to use more than the default number of open file
     44% descriptors you may need to:
     45
     46% o increase the limit on the total number of open files /proc/sys/fs/file-max
     47%   (on Linux systems)
     48
     49% o increase the size of FD_SETSIZE
     50%   - the way I often do this is to figure out which include file __FD_SETSIZE
     51%     is defined in, copy that file into an appropriate directory in ./include,
     52%     and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
     53%     gets used
     54
     55%   For example on a RH 9.0 distribution I've copied
     56%   /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
     57
     58%   Then I modify typesizes.h to look something like:
     59
     60%   #ifdef BIGGER_FD_SETSIZE
     61%   #define __FD_SETSIZE            32767
     62%   #else
     63%   #define __FD_SETSIZE            1024
     64%   #endif
     65
     66%   Note that the since I'm moving and testing the userver on may different
     67%   machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
     68
     69%   This way if you redefine the FD_SETSIZE it will get used instead of the
     70%   default original file.
     71% \end{comment}
    7272
    7373\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
     
    139139\subsection{Extra Kernel Threads}\label{io:morethreads}
    140140Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
    141 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
    142 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
     141In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading.
     142However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run.
    143143This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    144144This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
     
    155155\section{Event-Engine}
    156156An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    157 In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
    158 The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
     157In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
     158The parked \ats are then rescheduled by the event engine once the desired operation has completed.
    159159
    160160\subsection{\lstinline{io_uring} in depth}\label{iouring}
     
    268268\subsubsection{Private Instances}
    269269The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
    270 This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps.
    271 This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
     270This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps.
     271This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
    272272This failure is the serially reusable problem~\cite{SeriallyReusable}.
    273273Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
    274 To remove this requirement, a \gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
     274To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
    275275From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
    276276In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
    277277Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
    278 Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, \etc.
     278Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
    279279
    280280\begin{figure}
     
    288288
    289289This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
    290 However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly.
    291 Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations.
    292 In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
     290However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
     291Instances running out of SQEs cannot run \ats wanting to do \io operations.
     292In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
    293293
    294294A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
    295 \Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    296 While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
    297 
    298 Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set.
    299 Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.
    300 In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
    301 No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
    302 However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock.
    303 While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise ``in the wild''.
     295\ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
     296While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
     297
     298Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set.
     299Assume both \ats are running on the same \gls{proc}, and the \io \at is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \at.
     300In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
     301No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
     302However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
     303While this example is artificial, in the presence of many \ats, it is possible for this problem to arise ``in the wild''.
    304304Furthermore, this pattern is difficult to reliably detect and avoid.
    305 Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption, \eg time slicing.
    306 Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
     305Once in this situation, the only escape is to interrupted the spinning \at, either directly or via some regular preemption, \eg time slicing.
     306Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
    307307Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
    308 Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it.
     308Therefore, a more satisfying solution is for the \at submitting the operation to notice that the instance is unused and simply go ahead and use it.
    309309This approach is presented shortly.
    310310
    311311\subsubsection{Public Instances}
    312312The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
    313 \Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance.
    314 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
     313\ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
     314Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
    315315Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
    316316\begin{itemize}
     
    327327The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
    328328
    329 Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
     329Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
    330330Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
    331331
    332 Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to.
     332Once an SQE is allocated, \ats insert the \io request information, and keep track of the SQE index and the instance it belongs to.
    333333
    334334Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
     
    338338Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
    339339Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
    340 Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
    341 
    342 Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}.
     340Balancing submission can be handled by either designating one of the submitting \ats as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \at mentioned later in this section.
     341
     342Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}.
    343343However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
    344 Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request.
     344Indeed, as long as there is a ``next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request.
    345345Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
    346346
    347347Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
    348 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
     348Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
    349349Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    350350If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    351 A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
     351A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
    352352
    353353With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
    354 It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
     354It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    355355It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
    356356The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
    357 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
     357The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    358358The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    359359All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
     
    370370
    371371In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
    372 When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
    373 This instance is now bound to the \gls{proc} the \gls{thrd} is running on.
     372When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
     373This instance is now bound to the \gls{proc} the \at is running on.
    374374This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
    375375This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
     
    380380        \item The current \gls{proc} does not hold an instance.
    381381        \item The current instance does not have sufficient SQEs to satisfy the request.
    382         \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}.
     382        \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
    383383\end{enumerate}
    384384However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    r7a0f798b ra44514e  
    108108Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
    109109Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
    110 The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
     110The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
    111111Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
    112112However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
     
    125125
    126126\subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
    127 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@.
     127The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe \park/\unpark of a \gls{kthrd} to/from a @pthread_cond@.
    128128While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s.
    129129For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc.
     
    137137
    138138\subsection{Event FDs}
    139 Another interesting approach is to use an event file descriptor\cite{eventfd}.
     139Another interesting approach is to use an event file descriptor\cite{MAN:eventfd}.
    140140This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
    141141Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.
  • doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

    r7a0f798b ra44514e  
    1313\section{M:N Threading}\label{prev:model}
    1414
    15 Threading in \CFA is based on \Gls{uthrding}, where \glspl{thrd} are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \glspl{thrd} and switch among \glspl{thrd} liberally without many concerns for performance.
     15Threading in \CFA is based on \Gls{uthrding}, where \ats are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \ats and switch among \ats liberally without many concerns for performance.
    1616
    1717The \CFA M:N threading models is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
    1818The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
    19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \gls{thrd} until it context switches out, it then chooses a different \gls{thrd} to run.
     19The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run.
    2020
    2121\section{Clusters}
    2222\CFA allows the option to group user-level threading, in the form of clusters.
    23 Both \glspl{thrd} and \glspl{proc} belong to a specific cluster.
    24 \Glspl{thrd} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
     23Both \ats and \glspl{proc} belong to a specific cluster.
     24\Glspl{at} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
    2525Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism.
    2626It also opens the door to handling effects like NUMA, by pinning clusters to a specific NUMA node\footnote{This capability is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for CPU masks.}.
     
    3030                \input{system.pstex_t}
    3131        \end{center}
    32         \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{thrd} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \glspl{thrd}.}
     32        \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{at} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \ats.}
    3333        \label{fig:system}
    3434\end{figure}
     
    3838
    3939\section{\glsxtrshort{io}}\label{prev:io}
    40 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \glspl{thrd}. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \glspl{thrd}. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \gls{thrd} is ready to run. A simple example of this type of deadlock would be as follows:
     40Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
    4141
    4242\begin{quote}
    43 Given a simple network program with 2 \glspl{thrd} and a single \gls{proc}, one \gls{thrd} sends network requests to a server and the other \gls{thrd} waits for a response from the server.
    44 If the second \gls{thrd} races ahead, it may wait for responses to requests that have not been sent yet.
    45 In theory, this should not be a problem, even if the second \gls{thrd} waits, because the first \gls{thrd} is still ready to run and should be able to get CPU time to send the request.
    46 With M:N threading, while the first \gls{thrd} is ready, the lone \gls{proc} \emph{cannot} run the first \gls{thrd} if it is blocked in the \glsxtrshort{io} operation of the second \gls{thrd}.
     43Given a simple network program with 2 \ats and a single \gls{proc}, one \at sends network requests to a server and the other \at waits for a response from the server.
     44If the second \at races ahead, it may wait for responses to requests that have not been sent yet.
     45In theory, this should not be a problem, even if the second \at waits, because the first \at is still ready to run and should be able to get CPU time to send the request.
     46With M:N threading, while the first \at is ready, the lone \gls{proc} \emph{cannot} run the first \at if it is blocked in the \glsxtrshort{io} operation of the second \at.
    4747If this happen, the system is in a synchronization deadlock\footnote{In this example, the deadlock could be resolved if the server sends unprompted messages to the client.
    4848However, this solution is neither general nor appropriate even in this simple case.}.
    4949\end{quote}
    5050
    51 Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \glspl{thrd} rather than \glspl{proc} when doing \glsxtrshort{io} ope      rations.
    52 This feature entails multiplexing the \glsxtrshort{io} operations of many \glspl{thrd} onto fewer \glspl{proc}.
     51Therefore, one of the objective of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
     52This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}.
    5353The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
    5454This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration.
Note: See TracChangeset for help on using the changeset viewer.