Ignore:
Timestamp:
Sep 13, 2022, 3:07:25 PM (20 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
3606fe4
Parents:
1c334d1
Message:

Ran second spell/grammar checker and now the cows have come home

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r1c334d1 rfc96890  
    1313To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model.
    1414
    15 For threading, a simple and common execution mental model is the ``Ideal multi-tasking CPU'' :
     15For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' :
    1616
    1717\begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
    18         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
     18        {[The]} ``ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
    1919        \label{q:LinuxCFS}
    2020\end{displayquote}
     
    2222Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
    2323
    24 In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
     24In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
    2525This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
    2626This expectation of \at independence means the scheduler is expected to offer two guarantees:
     
    3131
    3232It is important to note that these guarantees are expected only up to a point.
    33 \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
     33\Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
    3434Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
    3535
     
    9292\subsubsection{Scalability}
    9393The most basic performance challenge of a scheduler is scalability.
    94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeues \ats.
    95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worst improvements.
     94Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
     95One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
    9696While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
    9797
     
    108108The problem is a single point of contention when adding/removing \ats.
    109109As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
    110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
     110The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
    111111
    112112Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
     
    114114\subsection{Work-Stealing}
    115115
    116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready-queue is work-stealing.
     116As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
    117117In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
    118118The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
     
    134134Timestamps are added to each element of a sub-queue.
    135135\item
    136 A \gls{proc} randomly tests ready-queues until it has acquired one or two queues.
     136A \gls{proc} randomly tests ready queues until it has acquired one or two queues.
    137137\item
    138138If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
     
    246246
    247247A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
    248 Note, that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
     248Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
    249249Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
    250250To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
     
    259259Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
    260260This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
    261 The good news is that this problem can be mitigated
     261The good news is that this problem can be mitigated.
    262262
    263263\subsection{Redundant Timestamps}\label{relaxedtimes}
     
    279279        \input{base_ts2.pstex_t}
    280280        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
    281         These timestamps are written to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
     281        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
    282282        \label{fig:base-ts2}
    283283\end{figure}
     
    292292With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
    293293The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
    294 For example, on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
     294For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
    295295Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
    296296However, these delays are not specific to systems with multiple CPUs.
     
    344344Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
    345345This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
    346 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
     346A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
    347347Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
    348348Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).
Note: See TracChangeset for help on using the changeset viewer.