Context Navigation

← Previous Changeset
Next Changeset →

Changeset fc96890

Timestamp:

Sep 13, 2022, 3:07:25 PM (2 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation

Children:

Parents:

Message:

Ran second spell/grammar checker and now the cows have come home

Location:

doc/theses/thierry_delisle_PhD/thesis/text

Files:

: 9 edited

conclusion.tex (modified) (5 diffs)
core.tex (modified) (12 diffs)
eval_macro.tex (modified) (6 diffs)
eval_micro.tex (modified) (12 diffs)
existing.tex (modified) (7 diffs)
intro.tex (modified) (8 diffs)
io.tex (modified) (7 diffs)
practice.tex (modified) (4 diffs)
runtime.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex ¶

-                      r1c334d1
+                      rfc96890
 Building the \CFA runtime has been a challenging project.
 The work was divided between high-level concurrency design and a user-level threading runtime (Masters's thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).
+The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis).
 Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
 This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future.
 I believed my Masters's work would provide the background to make the Ph.D. work reasonably straightforward.
+I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward.
 However, I discovered two significant challenges.
 …
 Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading.
 There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance.
 To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing.
+To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing.
 Unfortunately, little has changed in the intervening years.
 …
 The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
 hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
 Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
+Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
 The negative is that @io_uring@ is new and developing.
 As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
 …
 \section{Future Work}
 While the \CFA runtime achieves a better compromise, than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
+While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
 Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms.
 \subsection{Idle Sleep}
 A difficult challenge, not fully addressed in this thesis, is idle-sleep.
+A difficult challenge, not fully addressed in this thesis, is idle sleep.
 While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
 The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
 …
 \begin{itemize}
 \item
 The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed.
+The mechanism uses a handshake between notification and sleep to ensure that no \at is missed.
 \item
 The hand-shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
+The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
 \item
 Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/core.tex ¶

-                      r1c334d1
+                      rfc96890
 To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model.
 For threading, a simple and common execution mental model is the ``Ideal multi-tasking CPU'' :
+For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' :
 \begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
+        {[The]} ``ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
         \label{q:LinuxCFS}
 \end{displayquote}
 …
 Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
 In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
+In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
 This expectation of \at independence means the scheduler is expected to offer two guarantees:
 …
 It is important to note that these guarantees are expected only up to a point.
 \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
+\Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
 …
 \subsubsection{Scalability}
 The most basic performance challenge of a scheduler is scalability.
 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeues \ats.
 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worst improvements.
+Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
+One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
 …
 The problem is a single point of contention when adding/removing \ats.
 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
+The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
 …
 \subsection{Work-Stealing}
 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready-queue is work-stealing.
+As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
 …
 Timestamps are added to each element of a sub-queue.
 \item
 A \gls{proc} randomly tests ready-queues until it has acquired one or two queues.
+A \gls{proc} randomly tests ready queues until it has acquired one or two queues.
 \item
 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
 …
 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
 Note, that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
+Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
 Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
 …
 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
 The good news is that this problem can be mitigated
+The good news is that this problem can be mitigated.
 \subsection{Redundant Timestamps}\label{relaxedtimes}
 …
         \input{base_ts2.pstex_t}
         \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
         These timestamps are written to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
+        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
         \label{fig:base-ts2}
 \end{figure}
 …
 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
 For example, on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
+For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
 However, these delays are not specific to systems with multiple CPUs.
 …
 Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
+A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex ¶

-                      r1c334d1
+                      rfc96890
 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
 The Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}.
 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
+Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well as the \io subsystem for sockets.
 Note that this experiment does not exercise the \io subsystem with regard to disk operations because Memcached is an in-memory server.
 …
 For UDP connections, all the threads listen to a single UDP socket for incoming requests.
 Threads that are not currently dealing with another request ignore the incoming packet.
 One of the remaining, nonbusy, threads reads the request and sends the response.
+One of the remaining, non-busy, threads reads the request and sends the response.
 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
 \end{itemize}
 …
 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
 Because of this dramatic increase, the Y axis is presented using a log scale.
+Because of this dramatic increase, the Y-axis is presented using a log scale.
 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
 …
 Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests.
 \CFA begins to decline at 600K, indicating some bottleneck after saturation.
 Overall, all three web servers achieve micro-second latencies and the increases in latency mostly follow each other.
+Overall, all three web servers achieve microsecond latencies and the increases in latency mostly follow each other.
 \subsection{Update rate}
 …
 \begin{itemize}
 \item
 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
+A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load generator to run on a separate CPU.
 \item
 It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
 …
 To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O.
 The clients run httperf~\cite{httperf} to request a set of static files.
 The httperf load-generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
+The httperf load generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
 This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}.
 The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution.
 This distribution is representative of users accessing static data through a web browser.
 Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receives the file data, closes the connection, and repeats the process.
+Each request reads a file name from its trace, establishes a connection, performs an HTTP GET request for the file name, receives the file data, closes the connection, and repeats the process.
 Some trace elements have multiple file names that are read across a persistent connection.
 A client times out if the server does not complete a request within 10 seconds.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex ¶

-                      r1c334d1
+                      rfc96890
 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
 All of these systems have a \gls{uthrding} model.
 The goal of this chapter is show that the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
+The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
 Note that only the code of the \CFA tests is shown;
 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
 …
 For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}).
 Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
+Scalability uses the same data as throughput but the Y-axis is calculated as the number of \procs over the throughput.
 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
 …
 \CFA, Go and Tokio all obtain effectively the same throughput performance.
 Libfibre is slightly behind in this case but still scales decently.
 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flatting of the line.
+As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flattening of the line.
 This effect even causes a decrease in throughput in libfibre's case.
 As expected, this pattern repeats between \proc count 72 and 96.
 …
 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
 Indeed, unlike the left column, it is likely that the ready-queue is transiently empty, which likely triggers additional synchronization steps.
+Indeed, unlike the left column, it is likely that the ready queue is transiently empty, which likely triggers additional synchronization steps.
 Interestingly, libfibre achieves better performance with 1 cycle.
 …
 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers.
+For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
 \section{Yield}
 …
 Figures~\ref{fig:yield:jax} and \ref{fig:yield:nasus} show the results for the yield experiment on Intel and AMD, respectively.
 Looking at the left column on Intel, Figures~\ref{fig:yield:jax:ops} and \ref{fig:yield:jax:ns} show the results for 100 \ats for each \proc.
 Note, taht the Y-axis on this graph is twice as large as the Intel cycle graph.
+Note that the Y-axis on this graph is twice as large as the Intel cycle graph.
 A visual glance between the left columns of the cycle and yield graphs confirms my claim that the yield benchmark is unreliable.
 \CFA has no special handling for @yield@, but this experiment requires less synchronization than the @cycle@ experiment.
 Hence, the @yield@ throughput and scalability graphs have similar shapes to the corresponding @cycle@ graphs.
 The only difference is slightly better performance for @yield@ because of less synchronization.
 Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, by-passing the idle-sleep mechanism entirely.
+Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, bypassing the idle-sleep mechanism entirely.
 Hence, libfibre behaves very differently in the cycle and yield benchmarks, with a 4 times increase in performance on the left column.
 Go has special handling for @yield@ by putting a yielding goroutine on a secondary global ready-queue, giving it a lower priority.
 …
 The Cycle and Yield benchmarks represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application.
 In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
+In these benchmarks \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
 …
 In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues.
 Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstrate migrations do not need to be serialized.
+Indeed, the fact that most runtimes achieve some scaling between various \proc counts demonstrates migrations do not need to be serialized.
 Again these results demonstrate that \CFA achieves satisfactory performance compared to the other runtimes.
 …
 In the noshare variation, the array is not passed on and each thread continuously accesses its private array.
 In the share variation, the array is passed to another thread via the semaphore's shadow queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread.
 Figure~\ref{fig:locality:code} shows the pseudo-code for this benchmark.
+Figure~\ref{fig:locality:code} shows the pseudo code for this benchmark.
 The objective here is to highlight the different decisions made by the runtime when \glslink{atsched}{unparking}.
 …
 \CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach.
 \CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array.
 Libfibre on the other hand unparks remotely, and as such the unparked \at is likely to miss on the shared data.
+Libfibre, on the other hand, unparks remotely, and as such the unparked \at is likely to miss on the shared data.
 Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark.
 Otherwise, the results are similar to the churn benchmark, with lower throughput due to the array processing.
 …
 Again the overall performance is higher and slightly more variation is visible.
 Looking at the left column first, Figures~\ref{fig:locality:nasus:share:ops} and \ref{fig:locality:nasus:share:ns}, \CFA and Tokio still outperform libfibre, this time more significantly.
 This advantage is expected from the AMD server with its smaller and more narrow caches that magnify the costs of processing the array.
+This advantage is expected from the AMD server with its smaller and narrower caches that magnify the costs of processing the array.
 Go still has the same poor performance as on Intel.
 …
 The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once.
 The benchmark effectively provides the fairness guarantee in this case.
 In the yielding version, however the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable.
+In the yielding version, however, the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable.
 While this is an artificial scenario, in real life it requires only a few simple pieces.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/existing.tex ¶

-                      r1c334d1
+                      rfc96890
 In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
 Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
+Workloads that are well known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well-informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
+This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
 …
 In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead.
 The best performance generally means finding a middle ground between the two.
 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
+Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse-grained.
 \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
 …
 \cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
 However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully-strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
 It is unclear how well these distributions represent workloads in real world scenarios.
+However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
+It is unclear how well these distributions represent workloads in real-world scenarios.
 \section{Preemption}
 …
 \subsection{Operating System Schedulers}
 Operating System Schedulers tend to be fairly complex as they generally support some amount of real-time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
+Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
 Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
 The information is less complete for closed source operating systems.
 …
 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one-thousandth of the CPU time.}, increasing the complexity.
 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
+Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO scheduler that adds fixed time slices. \cite{MAN:linux/sched}
 \paragraph{FreeBSD}
 …
 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.
 Multicore scheduling is based on a combination of priorities and \proc preferance.
+Multicore scheduling is based on a combination of priorities and \proc preference.
 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
 \Glspl{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on.
 …
                 \item The task returned by \textit{t}@.execute()@
                 \item The successor of t if \textit{t} was its last completed predecessor.
                 \item A task popped from the end of the thread's own deque.
+                \item A task popped from the end of the thread's own queue.
                 \item A task with an affinity for the thread.
                 \item A task popped from approximately the beginning of the shared queue.
                 \item A task popped from the beginning of another randomly chosen thread's deque.
+                \item A task popped from the beginning of another randomly chosen thread's queue.
         \end{enumerate}

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/intro.tex ¶

-                      r1c334d1
+                      rfc96890
 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
+Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
 …
 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.
 The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
+This thesis analyzes multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.
+The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
 Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
+Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases.
 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
+After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
 …
 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
 For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
 A general-purpose dynamic-scheduler for an open system cannot anticipate work requests, so its performance is rarely optimal.
+In an open system, a general-purpose dynamic scheduler cannot anticipate work requests, so its performance is rarely optimal.
 Even with complete knowledge of arrival order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}.
 However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
 …
 timer alarm for preemption (running $\rightarrow$ ready)
 \item
 long term delay versus spinning (running $\rightarrow$ blocked)
+long-term delay versus spinning (running $\rightarrow$ blocked)
 \item
 completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready)
 …
 \noindent
 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
+Essentially, all multiprocessor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
 That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors.
 …
 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
+As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.
 The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
 some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
 The \io implementation is based on the @io_uring@ kernel-interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
+The \io implementation is based on the @io_uring@ kernel interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
 This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc.
 While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility.
 …
 \section{Contributions}\label{s:Contributions}
 This work provides the following scheduling contributions for advanced \gls{uthrding} runtime-systems:
+This work provides the following scheduling contributions for advanced \gls{uthrding} runtime systems:
 \begin{enumerate}[leftmargin=*]
 \item
 …
 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
 \item
 An optimization of the helping-mechanism for load balancing to reduce scheduling costs.
+An optimization of the helping mechanism for load balancing to reduce scheduling costs.
 \item
 An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/io.tex ¶

-                      r1c334d1
+                      rfc96890
 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
 For this work, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}.
+For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
 Leaving polling for completion, which is similar to the previous system calls.
 …
 \subsection{Extra Kernel Threads}\label{io:morethreads}
 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
+Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
 In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading.
 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run.
 …
 Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
 Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
+Furthermore, the routing algorithm should block operations upfront if none of the instances have available SQEs.
 Once an SQE is allocated, \ats insert the \io request information and keep track of the SQE index and the instance it belongs to.
 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
+Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread safe, and then the kernel must be notified using the @io_uring_enter@ system call.
 The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean an SQE multiple times in the ring buffer, which is undefined behaviour.
 However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations are complete.
 …
 Once the system call is done, the submitter must also free SQEs so that the allocator can reuse them.
 Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
+Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
 …
 \paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
 Otherwise, it must hold onto the list of threads until SQEs are made available again.
+Otherwise, it must hold on to the list of threads until SQEs are made available again.
 This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
 …
 The last important part of the \io subsystem is its interface.
 Multiple approaches can be offered to programmers, each with advantages and disadvantages.
 The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
+The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
 The following sections discuss some useful options using @read@ as an example.
 The standard Linux interface for C is :
+The standard Linux interface for C is:
 \begin{cfa}
 ssize_t read(int fd, void *buf, size_t count);
 …
 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
 It also offers a, presumably, well-known and familiar API that C programmers can simply continue to work with.
+It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/practice.tex ¶

-                      r1c334d1
+                      rfc96890
 A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader.
 Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
 Since this approach is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
+Since this approach is not a very complex challenge and an ad hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
 To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section.
 …
 The read-acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
 The writer acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
+Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
 The lock in nonblocking, so both readers and writers spin while the lock is held.
 …
 Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready.
+An interesting subpart of this heuristic is what to do with bursts of \ats that become ready.
 Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up.
 This fact begs the question, if many \procs are available, how many should be woken?
 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization.
 If the ready \ats will run for a short very short time, waking many \procs may be wasteful.
+If the ready \ats will run for a very short time, waking many \procs may be wasteful.
 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified.
 …
 This approach also simplifies notification.
 Indeed, \procs not only need to be notified when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined.
+Indeed, \procs not only need to be notified when a new \at is readied, but must also be notified during manual resizing, so the \gls{kthrd} can be joined.
 These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
 Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.

TabularUnified doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex ¶

-                      r1c334d1
+                      rfc96890
 The \CFA M:N threading model is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
 The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context-switches out, it then chooses a different \at to run.
+The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run.
 \section{Clusters}
 …
 All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
 \end{quote}
 Only UNIX @man@ pages identify whether or not a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
+Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: