Changeset fc96890


Ignore:
Timestamp:
Sep 13, 2022, 3:07:25 PM (2 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
3606fe4
Parents:
1c334d1
Message:

Ran second spell/grammar checker and now the cows have come home

Location:
doc/theses/thierry_delisle_PhD/thesis/text
Files:
9 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r1c334d1 rfc96890  
    22
    33Building the \CFA runtime has been a challenging project.
    4 The work was divided between high-level concurrency design and a user-level threading runtime (Masters's thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).
     4The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis).
    55Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
    66This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future.
    77
    8 I believed my Masters's work would provide the background to make the Ph.D. work reasonably straightforward.
     8I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward.
    99However, I discovered two significant challenges.
    1010
     
    2020Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading.
    2121There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance.
    22 To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing.
     22To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing.
    2323Unfortunately, little has changed in the intervening years.
    2424
     
    2626The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
    2727hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
    28 Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
     28Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
    2929The negative is that @io_uring@ is new and developing.
    3030As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
     
    7575
    7676\section{Future Work}
    77 While the \CFA runtime achieves a better compromise, than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
     77While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
    7878Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms.
    7979
    8080\subsection{Idle Sleep}
    81 A difficult challenge, not fully addressed in this thesis, is idle-sleep.
     81A difficult challenge, not fully addressed in this thesis, is idle sleep.
    8282While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
    8383The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
     
    8888\begin{itemize}
    8989\item
    90 The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed.
     90The mechanism uses a handshake between notification and sleep to ensure that no \at is missed.
    9191\item
    92 The hand-shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
     92The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
    9393\item
    9494Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r1c334d1 rfc96890  
    1313To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model.
    1414
    15 For threading, a simple and common execution mental model is the ``Ideal multi-tasking CPU'' :
     15For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' :
    1616
    1717\begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}]
    18         {[The]} ``Ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
     18        {[The]} ``ideal multi-tasking CPU'' is a (non-existent  :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed.  For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.
    1919        \label{q:LinuxCFS}
    2020\end{displayquote}
     
    2222Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
    2323
    24 In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware.
     24In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
    2525This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
    2626This expectation of \at independence means the scheduler is expected to offer two guarantees:
     
    3131
    3232It is important to note that these guarantees are expected only up to a point.
    33 \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources.
     33\Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
    3434Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
    3535
     
    9292\subsubsection{Scalability}
    9393The most basic performance challenge of a scheduler is scalability.
    94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeues \ats.
    95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worst improvements.
     94Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
     95One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
    9696While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
    9797
     
    108108The problem is a single point of contention when adding/removing \ats.
    109109As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
    110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
     110The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
    111111
    112112Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
     
    114114\subsection{Work-Stealing}
    115115
    116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready-queue is work-stealing.
     116As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
    117117In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
    118118The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
     
    134134Timestamps are added to each element of a sub-queue.
    135135\item
    136 A \gls{proc} randomly tests ready-queues until it has acquired one or two queues.
     136A \gls{proc} randomly tests ready queues until it has acquired one or two queues.
    137137\item
    138138If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
     
    246246
    247247A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
    248 Note, that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
     248Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
    249249Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
    250250To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
     
    259259Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
    260260This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
    261 The good news is that this problem can be mitigated
     261The good news is that this problem can be mitigated.
    262262
    263263\subsection{Redundant Timestamps}\label{relaxedtimes}
     
    279279        \input{base_ts2.pstex_t}
    280280        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
    281         These timestamps are written to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
     281        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
    282282        \label{fig:base-ts2}
    283283\end{figure}
     
    292292With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
    293293The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
    294 For example, on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
     294For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
    295295Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
    296296However, these delays are not specific to systems with multiple CPUs.
     
    344344Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
    345345This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
    346 A key element however is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
     346A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
    347347Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
    348348Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r1c334d1 rfc96890  
    1313Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
    1414The Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}.
    15 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.
     15Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well as the \io subsystem for sockets.
    1616Note that this experiment does not exercise the \io subsystem with regard to disk operations because Memcached is an in-memory server.
    1717
     
    4848For UDP connections, all the threads listen to a single UDP socket for incoming requests.
    4949Threads that are not currently dealing with another request ignore the incoming packet.
    50 One of the remaining, nonbusy, threads reads the request and sends the response.
     50One of the remaining, non-busy, threads reads the request and sends the response.
    5151This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
    5252\end{itemize}
     
    110110Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
    111111As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
    112 Because of this dramatic increase, the Y axis is presented using a log scale.
     112Because of this dramatic increase, the Y-axis is presented using a log scale.
    113113Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
    114114
     
    117117Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests.
    118118\CFA begins to decline at 600K, indicating some bottleneck after saturation.
    119 Overall, all three web servers achieve micro-second latencies and the increases in latency mostly follow each other.
     119Overall, all three web servers achieve microsecond latencies and the increases in latency mostly follow each other.
    120120
    121121\subsection{Update rate}
     
    269269\begin{itemize}
    270270\item
    271 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load-generator to run on a separate CPU.
     271A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load generator to run on a separate CPU.
    272272\item
    273273It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards.
     
    285285To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O.
    286286The clients run httperf~\cite{httperf} to request a set of static files.
    287 The httperf load-generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
     287The httperf load generator is used with session files to simulate a large number of users and to implement a partially open-loop system.
    288288This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}.
    289289
    290290The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution.
    291291This distribution is representative of users accessing static data through a web browser.
    292 Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receives the file data, closes the connection, and repeats the process.
     292Each request reads a file name from its trace, establishes a connection, performs an HTTP GET request for the file name, receives the file data, closes the connection, and repeats the process.
    293293Some trace elements have multiple file names that are read across a persistent connection.
    294294A client times out if the server does not complete a request within 10 seconds.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r1c334d1 rfc96890  
    44This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
    55All of these systems have a \gls{uthrding} model.
    6 The goal of this chapter is show that the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
     6The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
    77Note that only the code of the \CFA tests is shown;
    88all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
     
    4848
    4949For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}).
    50 Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
     50Scalability uses the same data as throughput but the Y-axis is calculated as the number of \procs over the throughput.
    5151In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
    5252
     
    171171\CFA, Go and Tokio all obtain effectively the same throughput performance.
    172172Libfibre is slightly behind in this case but still scales decently.
    173 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flatting of the line.
     173As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flattening of the line.
    174174This effect even causes a decrease in throughput in libfibre's case.
    175175As expected, this pattern repeats between \proc count 72 and 96.
     
    180180This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
    181181This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
    182 Indeed, unlike the left column, it is likely that the ready-queue is transiently empty, which likely triggers additional synchronization steps.
     182Indeed, unlike the left column, it is likely that the ready queue is transiently empty, which likely triggers additional synchronization steps.
    183183Interestingly, libfibre achieves better performance with 1 cycle.
    184184
     
    196196Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
    197197In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
    198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers.
     198For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
    199199
    200200\section{Yield}
     
    258258Figures~\ref{fig:yield:jax} and \ref{fig:yield:nasus} show the results for the yield experiment on Intel and AMD, respectively.
    259259Looking at the left column on Intel, Figures~\ref{fig:yield:jax:ops} and \ref{fig:yield:jax:ns} show the results for 100 \ats for each \proc.
    260 Note, taht the Y-axis on this graph is twice as large as the Intel cycle graph.
     260Note that the Y-axis on this graph is twice as large as the Intel cycle graph.
    261261A visual glance between the left columns of the cycle and yield graphs confirms my claim that the yield benchmark is unreliable.
    262262\CFA has no special handling for @yield@, but this experiment requires less synchronization than the @cycle@ experiment.
    263263Hence, the @yield@ throughput and scalability graphs have similar shapes to the corresponding @cycle@ graphs.
    264264The only difference is slightly better performance for @yield@ because of less synchronization.
    265 Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, by-passing the idle-sleep mechanism entirely.
     265Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, bypassing the idle-sleep mechanism entirely.
    266266Hence, libfibre behaves very differently in the cycle and yield benchmarks, with a 4 times increase in performance on the left column.
    267267Go has special handling for @yield@ by putting a yielding goroutine on a secondary global ready-queue, giving it a lower priority.
     
    330330
    331331The Cycle and Yield benchmarks represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application.
    332 In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
     332In these benchmarks \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
    333333
    334334The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
     
    486486
    487487In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues.
    488 Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstrate migrations do not need to be serialized.
     488Indeed, the fact that most runtimes achieve some scaling between various \proc counts demonstrates migrations do not need to be serialized.
    489489Again these results demonstrate that \CFA achieves satisfactory performance compared to the other runtimes.
    490490
     
    497497In the noshare variation, the array is not passed on and each thread continuously accesses its private array.
    498498In the share variation, the array is passed to another thread via the semaphore's shadow queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread.
    499 Figure~\ref{fig:locality:code} shows the pseudo-code for this benchmark.
     499Figure~\ref{fig:locality:code} shows the pseudo code for this benchmark.
    500500
    501501The objective here is to highlight the different decisions made by the runtime when \glslink{atsched}{unparking}.
     
    567567\CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach.
    568568\CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array.
    569 Libfibre on the other hand unparks remotely, and as such the unparked \at is likely to miss on the shared data.
     569Libfibre, on the other hand, unparks remotely, and as such the unparked \at is likely to miss on the shared data.
    570570Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark.
    571571Otherwise, the results are similar to the churn benchmark, with lower throughput due to the array processing.
     
    645645Again the overall performance is higher and slightly more variation is visible.
    646646Looking at the left column first, Figures~\ref{fig:locality:nasus:share:ops} and \ref{fig:locality:nasus:share:ns}, \CFA and Tokio still outperform libfibre, this time more significantly.
    647 This advantage is expected from the AMD server with its smaller and more narrow caches that magnify the costs of processing the array.
     647This advantage is expected from the AMD server with its smaller and narrower caches that magnify the costs of processing the array.
    648648Go still has the same poor performance as on Intel.
    649649
     
    725725The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once.
    726726The benchmark effectively provides the fairness guarantee in this case.
    727 In the yielding version, however the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable.
     727In the yielding version, however, the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable.
    728728
    729729While this is an artificial scenario, in real life it requires only a few simple pieces.
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r1c334d1 rfc96890  
    55
    66In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
    7 Workloads that are well-known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
     7Workloads that are well known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.
    88A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
    9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well-informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
     9This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
    1010Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
    1111
     
    7777In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead.
    7878The best performance generally means finding a middle ground between the two.
    79 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
     79Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse-grained.
    8080
    8181\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
     
    9797\cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
    9898
    99 However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully-strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
    100 It is unclear how well these distributions represent workloads in real world scenarios.
     99However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.
     100It is unclear how well these distributions represent workloads in real-world scenarios.
    101101
    102102\section{Preemption}
     
    115115
    116116\subsection{Operating System Schedulers}
    117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real-time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
     117Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
    118118Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
    119119The information is less complete for closed source operating systems.
     
    130130The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one-thousandth of the CPU time.}, increasing the complexity.
    131131
    132 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO-scheduler that adds fixed time slices. \cite{MAN:linux/sched}
     132Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO scheduler that adds fixed time slices. \cite{MAN:linux/sched}
    133133
    134134\paragraph{FreeBSD}
     
    144144
    145145In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.
    146 Multicore scheduling is based on a combination of priorities and \proc preferance.
     146Multicore scheduling is based on a combination of priorities and \proc preference.
    147147Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
    148148\Glspl{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on.
     
    195195                \item The task returned by \textit{t}@.execute()@
    196196                \item The successor of t if \textit{t} was its last completed predecessor.
    197                 \item A task popped from the end of the thread's own deque.
     197                \item A task popped from the end of the thread's own queue.
    198198                \item A task with an affinity for the thread.
    199199                \item A task popped from approximately the beginning of the shared queue.
    200                 \item A task popped from the beginning of another randomly chosen thread's deque.
     200                \item A task popped from the beginning of another randomly chosen thread's queue.
    201201        \end{enumerate}
    202202
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r1c334d1 rfc96890  
    33\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
    44The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
    5 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
     5Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    66To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    77which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     
    1111otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
    1212
    13 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.
    14 The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
     13This thesis analyzes multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.
     14The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
    1515Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
    16 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
     16Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    1717
    18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
     18After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
    1919The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
    2020The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
     
    3535In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
    3636For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
    37 A general-purpose dynamic-scheduler for an open system cannot anticipate work requests, so its performance is rarely optimal.
     37In an open system, a general-purpose dynamic scheduler cannot anticipate work requests, so its performance is rarely optimal.
    3838Even with complete knowledge of arrival order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}.
    3939However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
     
    5353timer alarm for preemption (running $\rightarrow$ ready)
    5454\item
    55 long term delay versus spinning (running $\rightarrow$ blocked)
     55long-term delay versus spinning (running $\rightarrow$ blocked)
    5656\item
    5757completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready)
     
    8484
    8585\noindent
    86 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
     86Essentially, all multiprocessor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
    8787When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
    8888That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors.
     
    120120To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
    121121
    122 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
     122As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.
    123123The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
    124124some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
    125 The \io implementation is based on the @io_uring@ kernel-interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
     125The \io implementation is based on the @io_uring@ kernel interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
    126126This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc.
    127127While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility.
     
    129129
    130130\section{Contributions}\label{s:Contributions}
    131 This work provides the following scheduling contributions for advanced \gls{uthrding} runtime-systems:
     131This work provides the following scheduling contributions for advanced \gls{uthrding} runtime systems:
    132132\begin{enumerate}[leftmargin=*]
    133133\item
     
    140140A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
    141141\item
    142 An optimization of the helping-mechanism for load balancing to reduce scheduling costs.
     142An optimization of the helping mechanism for load balancing to reduce scheduling costs.
    143143\item
    144144An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r1c334d1 rfc96890  
    5353Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
    5454Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
    55 For this work, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}.
     55For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
    5656Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
    5757Leaving polling for completion, which is similar to the previous system calls.
     
    100100
    101101\subsection{Extra Kernel Threads}\label{io:morethreads}
    102 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
     102Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
    103103In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading.
    104104However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run.
     
    290290
    291291Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
    292 Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
     292Furthermore, the routing algorithm should block operations upfront if none of the instances have available SQEs.
    293293
    294294Once an SQE is allocated, \ats insert the \io request information and keep track of the SQE index and the instance it belongs to.
    295295
    296 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
     296Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread safe, and then the kernel must be notified using the @io_uring_enter@ system call.
    297297The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean an SQE multiple times in the ring buffer, which is undefined behaviour.
    298298However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations are complete.
     
    307307Once the system call is done, the submitter must also free SQEs so that the allocator can reuse them.
    308308
    309 Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
     309Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
    310310Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
    311311Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
     
    368368
    369369\paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
    370 Otherwise, it must hold onto the list of threads until SQEs are made available again.
     370Otherwise, it must hold on to the list of threads until SQEs are made available again.
    371371This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
    372372
     
    383383The last important part of the \io subsystem is its interface.
    384384Multiple approaches can be offered to programmers, each with advantages and disadvantages.
    385 The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
     385The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
    386386The following sections discuss some useful options using @read@ as an example.
    387 The standard Linux interface for C is :
     387The standard Linux interface for C is:
    388388\begin{cfa}
    389389ssize_t read(int fd, void *buf, size_t count);
     
    394394The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    395395This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    396 It also offers a, presumably, well-known and familiar API that C programmers can simply continue to work with.
     396It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
    397397However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    398398If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    r1c334d1 rfc96890  
    5555A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader.
    5656Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
    57 Since this approach is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
     57Since this approach is not a very complex challenge and an ad hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
    5858
    5959To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section.
     
    6161The read-acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
    6262The writer acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
    63 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
     63Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
    6464Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
    6565The lock in nonblocking, so both readers and writers spin while the lock is held.
     
    113113Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
    114114
    115 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready.
     115An interesting subpart of this heuristic is what to do with bursts of \ats that become ready.
    116116Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up.
    117117This fact begs the question, if many \procs are available, how many should be woken?
    118118If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization.
    119 If the ready \ats will run for a short very short time, waking many \procs may be wasteful.
     119If the ready \ats will run for a very short time, waking many \procs may be wasteful.
    120120As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified.
    121121
     
    173173
    174174This approach also simplifies notification.
    175 Indeed, \procs not only need to be notified when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined.
     175Indeed, \procs not only need to be notified when a new \at is readied, but must also be notified during manual resizing, so the \gls{kthrd} can be joined.
    176176These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
    177177Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
  • doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

    r1c334d1 rfc96890  
    1717The \CFA M:N threading model is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
    1818The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
    19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context-switches out, it then chooses a different \at to run.
     19The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run.
    2020
    2121\section{Clusters}
     
    6060All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
    6161\end{quote}
    62 Only UNIX @man@ pages identify whether or not a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
     62Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
    6363
    6464Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
Note: See TracChangeset for help on using the changeset viewer.