Changeset fc96890 for doc/theses/thierry_delisle_PhD/thesis/text/core.tex
- Timestamp:
- Sep 13, 2022, 3:07:25 PM (20 months ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 3606fe4
- Parents:
- 1c334d1
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r1c334d1 rfc96890 13 13 To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model. 14 14 15 For threading, a simple and common execution mental model is the `` Ideal multi-tasking CPU'' :15 For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' : 16 16 17 17 \begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}] 18 {[The]} `` Ideal multi-tasking CPU'' is a (non-existent :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed. For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.18 {[The]} ``ideal multi-tasking CPU'' is a (non-existent :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed. For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel. 19 19 \label{q:LinuxCFS} 20 20 \end{displayquote} … … 22 22 Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model. 23 23 24 In general, the expectation at the cent erof this model is that ready \ats do not interfere with each other but simply share the hardware.24 In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware. 25 25 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. 26 26 This expectation of \at independence means the scheduler is expected to offer two guarantees: … … 31 31 32 32 It is important to note that these guarantees are expected only up to a point. 33 \Glspl{at} that are ready to run should not be prevented to doso, but they still share the limited hardware resources.33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources. 34 34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. 35 35 … … 92 92 \subsubsection{Scalability} 93 93 The most basic performance challenge of a scheduler is scalability. 94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue s\ats.95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worstimprovements.94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats. 95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements. 96 96 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention. 97 97 … … 108 108 The problem is a single point of contention when adding/removing \ats. 109 109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering. 111 111 112 112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis. … … 114 114 \subsection{Work-Stealing} 115 115 116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready -queue is work-stealing.116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing. 117 117 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue. 118 118 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. … … 134 134 Timestamps are added to each element of a sub-queue. 135 135 \item 136 A \gls{proc} randomly tests ready -queues until it has acquired one or two queues.136 A \gls{proc} randomly tests ready queues until it has acquired one or two queues. 137 137 \item 138 138 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues. … … 246 246 247 247 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}. 248 Note ,that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. 249 249 Therefore, the exponential moving average is an average of how long each dequeued \at has waited. 250 250 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue. … … 259 259 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue. 260 260 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost. 261 The good news is that this problem can be mitigated 261 The good news is that this problem can be mitigated. 262 262 263 263 \subsection{Redundant Timestamps}\label{relaxedtimes} … … 279 279 \input{base_ts2.pstex_t} 280 280 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps. 281 These timestamps are written 281 These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.} 282 282 \label{fig:base-ts2} 283 283 \end{figure} … … 292 292 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines. 293 293 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. 294 For example ,on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket. 295 295 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU. 296 296 However, these delays are not specific to systems with multiple CPUs. … … 344 344 Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to. 345 345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. 346 A key element howeveris that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.346 A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. 347 347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance. 348 348 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).
Note: See TracChangeset
for help on using the changeset viewer.