Context Navigation

-                      r1c334d1
+                      rfc96890
 To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model.
 For threading, a simple and common execution mental model is the ``Ideal multi-tasking CPU'' :
+For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' :
 \begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
+        {[The]} ``ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
         \label{q:LinuxCFS}
 \end{displayquote}
 …
 Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
 In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
+In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
 This expectation of \at independence means the scheduler is expected to offer two guarantees:
 …
 It is important to note that these guarantees are expected only up to a point.
 \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
+\Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
 …
 \subsubsection{Scalability}
 The most basic performance challenge of a scheduler is scalability.
 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeues \ats.
 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worst improvements.
+Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
+One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
 …
 The problem is a single point of contention when adding/removing \ats.
 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
+The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
 …
 \subsection{Work-Stealing}
 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready-queue is work-stealing.
+As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
 …
 Timestamps are added to each element of a sub-queue.
 \item
 A \gls{proc} randomly tests ready-queues until it has acquired one or two queues.
+A \gls{proc} randomly tests ready queues until it has acquired one or two queues.
 \item
 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
 …
 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
 Note, that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
+Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
 Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
 …
 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
 The good news is that this problem can be mitigated
+The good news is that this problem can be mitigated.
 \subsection{Redundant Timestamps}\label{relaxedtimes}
 …
         \input{base_ts2.pstex_t}
         \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
         These timestamps are written to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
+        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
         \label{fig:base-ts2}
 \end{figure}
 …
 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
 For example, on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
+For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
 However, these delays are not specific to systems with multiple CPUs.
 …
 Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
+A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset fc96890 for doc/theses/thierry_delisle_PhD/thesis/text/core.tex

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

Download in other formats: