Changeset ddcaff6


Ignore:
Timestamp:
Nov 24, 2022, 3:41:44 PM (2 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
master
Children:
dacd8e6
Parents:
82a90d4
Message:

Last corrections to my thesis... hopefully

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
12 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/fig/base_ts2.fig

    r82a90d4 rddcaff6  
    1191192 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
    120120         2400 4950 3000 4950
    121 4 2 -1 50 -1 0 12 0.0000 2 135 645 2100 3075 Threads\001
    122 4 2 -1 50 -1 0 12 0.0000 2 180 525 2100 2850 Ready\001
    123 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 4200 Array of\001
    124 4 2 -1 50 -1 0 12 0.0000 2 165 600 2100 4425 Queues\001
    125 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 3550 TS\001
    126 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 5025 Array of\001
    127 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 4450 MA\001
    128 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 4225 TS\001
    129 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 5350 MA\001
    130 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 5125 TS\001
    131 4 2 -1 50 -1 0 12 0.0000 2 180 1590 2100 5250 Timestamps Copies\001
    132 4 2 -1 50 -1 0 12 0.0000 2 135 840 2100 6075 Processors\001
     1214 2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001
     1224 2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001
     1234 2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001
     1244 2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001
     1254 1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001
     1264 2 -1 50 -1 0 12 0.0000 2 165 720 2100 5025 Array of\001
     1274 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001
     1284 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4225 TS\001
     1294 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5350 MA\001
     1304 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5125 TS\001
     1314 2 -1 50 -1 0 12 0.0000 2 135 900 2100 6075 Processors\001
     1324 2 -1 50 -1 0 12 0.0000 2 165 1440 2100 5250 Timestamp Copies\001
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r82a90d4 rddcaff6  
    615615}
    616616
     617@book{MAN:inteldev,
     618  key = {Intel 64 and IA-32 Architectures Software Developer’s Manual},
     619  title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual},
     620  publisher = {Intel{\textregistered}},
     621  year = {2016},
     622  volume = {3B: System Programming Guide, Part 2},
     623  url = {\href{https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html}{https://\-www.intel.com/\-content/\-www/\-us/\-en/\-architecture\--and\--technology/\-64\--ia\--32\--architectures\--software\--developer\--vol\--3b\--part\--2\--manual\-.html},
     624}
     625
    617626@misc{MemcachedThreading,
    618627  author = {Oracle},
     
    673682}
    674683
     684@misc{MAN:kotlink,
     685  howpublished = {\href{https://kotlinlang.org/docs/multiplatform-mobile-concurrency-and-coroutines.html}{https://\-kotlinlang.org\-/docs\-/multiplatform\--mobile\--concurrency\--and\--coroutines.html}}
     686}
     687
    675688@misc{MAN:java/fork-join,
    676689  howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}}
     
    965978  howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}",
    966979  note = "[Online; accessed 7-September-2022]"
     980}
     981
     982@misc{wiki:rdtsc,
     983  author = "{Wikipedia contributors}",
     984  title = "Time Stamp Counter --- {W}ikipedia{,} The Free Encyclopedia",
     985  year = "2022",
     986  howpublished = "\href{https://en.wikipedia.org/wiki/Time\_Stamp\_Counter}{https://\-en.wikipedia.org/\-wiki/\-Time\-\_Stamp\-\_Counter}",
     987  note = "[Online; accessed 14-November-2022]"
     988}
     989
     990@misc{wiki:lockfree,
     991  author = "{Wikipedia contributors}",
     992  title = "Non-blocking algorithm --- {W}ikipedia{,} The Free Encyclopedia",
     993  year = "2022",
     994  howpublished = "\href{https://en.wikipedia.org/wiki/Non-blocking_algorithm}{https://en.wikipedia.org/\-wiki/Non\--blocking\-\_algorithm}",
     995  note = "[Online; accessed 22-November-2022]"
     996}
     997
     998@misc{wiki:lockfree,
     999  author = "{Wikipedia contributors}",
     1000  title = "Expected value --- {W}ikipedia{,} The Free Encyclopedia",
     1001  year = "2022",
     1002  howpublished = "\href{https://en.wikipedia.org/wiki/Expected_value}{https://en.wikipedia.org/\-wiki/\-Expected\-\_value}",
     1003  note = "[Online; accessed 22-November-2022]"
     1004}
     1005
     1006@misc{wiki:softirq,
     1007  author = "{Wikipedia contributors}",
     1008  title = "Interrupt --- {W}ikipedia{,} The Free Encyclopedia",
     1009  year = "2022",
     1010  howpublished = "\href{https://en.wikipedia.org/wiki/Interrupt}{https://en.wikipedia.org/\-wiki/\-Interrupt}",
     1011  note = "[Online; accessed 24-November-2022]"
    9671012}
    9681013
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r82a90d4 rddcaff6  
    3737Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system.
    3838Nonblocking I/O should not be handled in this way.
     39Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work.
    3940
    4041\section{Goals}
     
    5859
    5960As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
    60 Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime.
     61Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime.
    6162Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
    6263
     
    99100I am aware there is a host of low-power research that could be tapped here.
    100101
     102\subsection{CPU Workloads}
     103A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it
     104CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily.
     105Examining such workloads in the context of scheduling would be interesting.
     106However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work.
     107
    101108\subsection{Hardware}
    102109One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness.
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r82a90d4 rddcaff6  
    2424In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware.
    2525This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored.
    26 This expectation of \at independence means the scheduler is expected to offer two guarantees:
     26This expectation of \at independence means the scheduler is expected to offer two features:
    2727\begin{enumerate}
    28         \item A fairness guarantee: a \at that is ready to run is not prevented by another thread.
    29         \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.
     28        \item A fairness guarantee: a \at that is ready to run is not prevented by another thread indefinitely, \ie, starvation freedom. This is discussed further in the next section.
     29        \item A performance goal: given a \at that wants to start running, other threads wanting to do the same do not interfere with it.
    3030\end{enumerate}
    3131
    32 It is important to note that these guarantees are expected only up to a point.
    33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources.
    34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small.
    35 
    36 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point.
    37 Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention.
     32The performance goal, the lack of interference among threads, is only desired up to a point.
     33Ideally, the cost of running and blocking should be constant regardless of contention, but the goal is considered satisfied if the cost is not \emph{too high} with or without contention.
    3834How much is an acceptable cost is obviously highly variable.
    3935For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages.
    4036This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models.
    4137Recall programmer expectation is that the impact of the scheduler can be ignored.
    42 Therefore, if the cost of scheduling is competitive with other popular languages, the guarantee is considered achieved.
     38Therefore, if the cost of scheduling is competitive with other popular languages, the goal is considered satisfied.
    4339More precisely the scheduler should be:
    4440\begin{itemize}
    45         \item As fast as other schedulers that are less fair.
    46         \item Faster than other schedulers that have equal or better fairness.
     41        \item As fast as other schedulers without any fairness guarantee.
     42        \item Faster than other schedulers that have equal or stronger fairness guarantees.
    4743\end{itemize}
    4844
    4945\subsection{Fairness Goals}
    50 For this work, fairness is considered to have two strongly related requirements: true starvation freedom and ``fast'' load balancing.
    51 
    52 \paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress.
     46For this work, fairness is considered to have two strongly related requirements:
     47
     48\paragraph{Starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress.
     49Starvation freedom can be bounded or unbounded.
     50In the bounded case, all \ats should be able to run within a fix bound, relative to its own enqueue.
     51Whereas unbounded starvation freedom only requires the \at to eventually run.
     52The \CFA scheduler aims to guarantee unbounded starvation freedom.
    5353In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks.
    54 Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case.
    55 Now, this requirement begs the question, what about preemption?
     54Without preemption, traditional work-stealing schedulers do not have starvation freedom, bounded or unbounded.
     55Now, this requirement raises the question, what about preemption?
    5656Generally speaking, preemption happens on the timescale of several milliseconds, which brings us to the next requirement: ``fast'' load balancing.
    5757
    58 \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow.
    59 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved.
    60 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale.
    61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system.
     58\paragraph{Fast load balancing} means that while eventual progress is guaranteed, it is important to mention on which timescale this progress is expected to happen.
     59Indeed, while a scheduler with bounded starvation freedom is beyond the scope of this work, offering a good expected bound in the mathematical sense~\cite{wiki:expected} is desirable.
     60The expected bound on starvation freedom should be tighter than what preemption normally allows.
     61For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait milliseconds to run are effectively starved.
     62Therefore load-balancing should be done at a faster pace: one that is expected to detect starvation at the microsecond scale.
    6263
    6364\subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal}
     
    6869
    6970For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness.
    70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.
    71 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.
     71Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as a consequence cache lines, to a \gls{hthrd} that is currently available.
     72Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler, versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.
    7273External locality is a much more complicated subject and is discussed in the next section.
    7374
    74 However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally.
    75 Figure~\ref{fig:fair} shows a visual representation of this behaviour.
    76 As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental model.
     75However, I claim that in practice it is possible to strike a balance between fairness and performance because these requirements do not necessarily overlap temporally.
     76Figure~\ref{fig:fair} shows a visual representation of this effect.
     77As mentioned, some unfairness is acceptable; for example, once the bounded starvation guarantee is met, additional fairness will not satisfy it \emph{more}.
     78Inversely, once a \at's data is evicted from cache, its locality cannot worsen.
     79Therefore it is desirable to have an algorithm that prioritizes cache locality as long as the fairness guarantee is also satisfied.
    7780
    7881\begin{figure}
     
    8891\subsection{Performance Challenges}\label{pref:challenge}
    8992While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges.
    90 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design.
     93Since these challenges are recurring themes in the design of a scheduler it is relevant to describe them here before looking at the scheduler's design.
     94
     95\subsubsection{Latency}
     96The most basic performance metric of a scheduler is scheduling latency.
     97This measures the how long it takes for a \at to run once scheduled, including the cost of scheduling itself.
     98This measure include both the sequential cost of the operation itself, both also the scalability.
    9199
    92100\subsubsection{Scalability}
    93 The most basic performance challenge of a scheduler is scalability.
    94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats.
     101Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats relative to the available parallelism.
    95102One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements.
    96103While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention.
     104In the Chapter~\ref{microbench}, scalability is measured as $\# procs \times \frac{ns}{ops}$, \ie, number of \procs times total time over total operations.
     105Since the total number of operation should scale with the number of \procs, this gives a measure how much each additional \proc affects the other \procs.
    97106
    98107\subsubsection{Migration Cost}
     
    107116In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance.
    108117The problem is a single point of contention when adding/removing \ats.
    109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
    110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering.
    111 
    112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis.
     118As is shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}.
     119The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue.
     120The sub-queues are accessed by multiple \glspl{hthrd} without the need for communication.
     121
     122Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration for the scheduler in this thesis.
    113123
    114124\subsection{Work-Stealing}
     
    116126As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing.
    117127In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue.
    118 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
    119 In this case, work stealing is close to optimal scheduling: it can achieve perfect locality and have no contention.
    120 On the other hand, work-stealing schedulers only attempt to do load-balancing when a \gls{proc} runs out of work.
     128The interesting aspect of work stealing manifests itself in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.
     129In this case, work stealing is close to optimal scheduling latency: it can achieve perfect locality and have no contention.
     130On the other hand, work-stealing only attempts to do load-balancing when a \gls{proc} runs out of work.
    121131This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work.
    122 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to indefinite starvation.
    123 
    124 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, but load balance proactively when the need arises.
     132Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to unbounded starvation.
     133
     134Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, \ie, avoid migrations in well balanced workloads, but load balance proactively when the need arises.
    125135
    126136\subsection{Relaxed-FIFO}
    127 A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
     137A different scheduling approach is the ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
    128138This approach forgoes any ownership between \gls{proc} and sub-queue, and simply creates a pool of sub-queues from which \glspl{proc} pick.
    129139Scheduling is performed as follows:
     
    134144Timestamps are added to each element of a sub-queue.
    135145\item
    136 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues.
     146A \gls{proc} randomly tests sub-queues until it has acquired one or two queues, referred to as \newterm{randomly picking} or \newterm{randomly helping}.
    137147\item
    138148If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues.
     
    148158However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality.
    149159
    150 While this scheme has good fairness, its performance suffers.
    151 It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats.
     160While this scheme has good fairness, its performance can be improved.
     161Wide sharding is generally desired, \eg at least 4 queues per \proc, and randomly picking non-empty queues is difficult when there are few ready \ats.
     162The next sections describe improvements I made to this existing algorithm.
     163However, ultimately the ``relaxed-FIFO'' queue is not used as the basis of the \CFA scheduler.
    152164
    153165\section{Relaxed-FIFO++}
    154 The inherent fairness and good performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.
     166The inherent fairness and decent performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.
    155167The problem case is workloads where the number of \ats is barely greater than the number of \procs.
    156168In these situations, the wide sharding of the ready queue means most of its sub-queues are empty.
     
    162174
    163175As this is the most obvious challenge, it is worth addressing first.
    164 The obvious solution is to supplement each sharded sub-queue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}.
    165 This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues.
    166 Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity.
     176The seemingly obvious solution is to supplement each sharded sub-queue with data that indicates whether the queue is empty/nonempty.
     177This simplifies finding nonempty queues, \ie ready \glspl{at}.
     178The sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues, using a bit or a node per sub-queue, respectively.
     179Specifically, many modern architectures have powerful bitmask manipulation instructions, and, searching a binary tree has good Big-O complexity.
    167180However, precisely tracking nonempty sub-queues is problematic.
    168181The reason is that the sub-queues are initially sharded with a width presumably chosen to avoid contention.
    169 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded sub-queues.
     182However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie tracks whether multiple sub-queues are empty.
    170183Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a sub-queue at random.
    171184But if the tracking mechanism \emph{is} denser than the shared sub-queues, then constant updates invariably create a new source of contention.
     
    184197
    185198\subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy}
    186 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng}.
     199The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} that drives the random picking of sub-queues.
    187200The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed.
    188201While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the sub-queues that were accessed.
    189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
     202Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators~\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.
    190203This particular \glsxtrshort{prng} can be used as follows:
    191204\begin{itemize}
     
    193206Each \proc maintains two \glsxtrshort{prng} states, referred to as $F$ and $B$.
    194207\item
    195 When a \proc attempts to dequeue a \at, it picks a sub-queue by running $B$ backwards.
    196 \item
    197 When a \proc attempts to enqueue a \at, it runs $F$ forward picking a sub-queue to enqueue to.
    198 If the enqueue is successful, state $B$ is overwritten with the content of $F$.
     208When a \proc attempts to dequeue a \at, it picks a sub-queue by running its $B$ backwards.
     209\item
     210When a \proc attempts to enqueue a \at, it runs its $F$ forward picking a sub-queue to enqueue to.
     211If the enqueue is successful, state of its $B$ is overwritten with the content of its $F$.
    199212\end{itemize}
    200213The result is that each \proc tends to dequeue \ats that it has itself enqueued.
    201214When most sub-queues are empty, this technique increases the odds of finding \ats at a very low cost, while also offering an improvement on locality in many cases.
    202215
    203 Tests showed this approach performs better than relaxed-FIFO in many cases.
     216My own tests showed this approach performs better than relaxed-FIFO in many cases.
    204217However, it is still not competitive with work-stealing algorithms.
    205 The fundamental problem is that the constant randomness limits how much locality the scheduler offers.
     218The fundamental problem is that the randomness limits how much locality the scheduler offers.
    206219This becomes problematic both because the scheduler is likely to get cache misses on internal data structures and because migrations become frequent.
    207220Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out.
     
    214227Before attempting to dequeue from a \proc's sub-queue, the \proc must make some effort to ensure other sub-queues are not being neglected.
    215228To make this possible, \procs must be able to determine which \at has been on the ready queue the longest.
    216 Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible.
     229Second, the relaxed-FIFO approach uses timestamps, denoted TS, for each \at to make this possible.
     230Theses timestamps can be added to work stealing.
    217231
    218232\begin{figure}
    219233        \centering
    220234        \input{base.pstex_t}
    221         \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of sub-queues offers the sharding, two per \proc.
     235        \caption[Base \CFA design]{Base \CFA design \smallskip\newline It uses a pool of sub-queues, with a sharding of two sub-queue per \proc.
    222236        Each \gls{proc} can access all of the sub-queues.
    223237        Each \at is timestamped when enqueued.}
     
    227241Figure~\ref{fig:base} shows the algorithm structure.
    228242This structure is similar to classic work-stealing except the sub-queues are placed in an array so \procs can access them in constant time.
    229 Sharding width can be adjusted based on contention.
    230 Note, as an optimization, the TS of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.
     243Sharding can be adjusted based on contention.
     244As an optimization, the timestamp of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.
    231245This organization keeps the highly accessed front TSs directly in the array.
    232246When a \proc attempts to dequeue a \at, it first picks a random remote sub-queue and compares its timestamp to the timestamps of its local sub-queue(s).
    233 The oldest waiting \at is dequeued to provide global fairness.
     247The oldest waiting of the compared \ats is dequeued.
     248In this document, picking from a remote sub-queue in this fashion is referred to as ``helping''.
     249
     250The timestamps are measured using the CPU's hardware timestamp counters~\cite{wiki:rdtsc}.
     251These provide a 64-bit counter that tracks the number of cycles since the CPU was powered on.
     252Assuming the CPU runs at less than 5 GHz, this means that the 64-bit counter takes over a century before overflowing.
     253This is true even on 32-bit CPUs, where the counter is generally still 64-bit.
     254However, on many architectures, the instructions to read the counter do not have any particular ordering guarantees.
     255Since the counter does not depend on any data in the cpu pipeline, this means there is significant flexibility for the instruction to be read out of order, which limites the accuracy to a window of code.
     256Finally, another issue that can come up with timestamp counters is synchronization between \glspl{hthrd}.
     257This appears to be mostly a historical concern, as recent CPU offer more synchronization guarantees.
     258For example, Intel supports "Invariant TSC" \cite[\S~17.15.1]{MAN:inteldev} which is guaranteed to be synchronized across \glspl{hthrd}.
    234259
    235260However, this na\"ive implementation has performance problems.
    236 First, it is necessary to have some damping effect on helping.
    237 Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help.
    238 These bursts can cause increased migrations and make this work-stealing approach slow down to the level of relaxed-FIFO.
     261First, it is necessary to avoid helping when it does not improve fairness.
     262Random effects like cache misses and preemption can add unpredictable but short bursts of latency but do not warrant the cost of helping.
     263These bursts can cause increased migrations, at which point this same locality problems as in the relaxed-FIFO approach start to appear.
    239264
    240265\begin{figure}
     
    246271
    247272A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}.
    248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
     273Note that this is more complex than it can appear because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.
    249274Therefore, the exponential moving average is an average of how long each dequeued \at has waited.
    250275To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue.
    251276This new waiting is averaged with the stored average.
    252277To further limit \glslink{atmig}{migrations}, a bias can be added to a local sub-queue, where a remote sub-queue is helped only if its moving average is more than $X$ times the local sub-queue's average.
    253 Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
    254 
    255 With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations.
     278Tests for this approach indicate the precise values for the weight of the moving average and the bias are not important, \ie weights and biases of similar \emph{magnitudes} have similar effects.
     279
     280With these additions to work stealing, scheduling can satisfy the starvation freedom guarantee while suffering much less from unnecessary migrations than the relaxed-FIFO approach.
    256281Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term unfairness or no starvation.
    257 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated, \ie, writes.
    258 To make things worst, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.
     282The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated.
     283To make things worse, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.
    259284Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue.
    260285This puts this algorithm in the awkward situation of paying for a largely unnecessary cost.
     
    264289The problem with polling remote sub-queues is that correctness is critical.
    265290There must be a consensus among \procs on which sub-queues hold which \ats, as the \ats are in constant motion.
    266 Furthermore, since timestamps are used for fairness, it is critical to have a consensus on which \at is the oldest.
     291Furthermore, since timestamps are used for fairness, it is critical that the oldest \ats eventually be recognized as such.
    267292However, when deciding if a remote sub-queue is worth polling, correctness is less of a problem.
    268293Since the only requirement is that a sub-queue is eventually polled, some data staleness is acceptable.
     
    278303        \centering
    279304        \input{base_ts2.pstex_t}
    280         \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps.
     305        \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline This design uses an array containing a copy of the timestamps.
    281306        These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.}
    282307        \label{fig:base-ts2}
     
    285310The correctness argument is somewhat subtle.
    286311The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation.
    287 Therefore, it is acceptable if stale data makes queues appear older than they are but appearing fresher can be a problem.
     312Therefore, it is acceptable if stale data makes queues appear older than they are, but appearing fresher can be a problem.
    288313For the timestamps, this means it is acceptable to miss writes to the timestamp since they make the head \at look older.
    289314For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written.
     
    292317With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines.
    293318The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}.
    294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.
     319For example on machines with multiple CPUs, cache misses can be satisfied from the caches on the same (local) CPU, or by the caches on a different (remote) CPU.
    295320Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU.
    296321However, these delays are not specific to systems with multiple CPUs.
     
    312337Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference.
    313338In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU.
    314 This means latency due to cache misses is fairly consistent.
    315 In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of the L3 cache.
     339This means that latency due to cache misses is fairly consistent.
     340In contrast, in Figure~\ref{fig:cache-noshare}, misses in the L2 cache can be satisfied by either instance of the L3 cache.
    316341However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3.
    317 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}.
     342The impact of these different designs on this algorithm is that scheduling only scales well on architectures with the L3 cache shared across many \glspl{hthrd}, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many L3 cache instances and less sharing, similar to Figure~\ref{fig:cache-noshare}.
    318343Hence, as the number of L3 instances grows, so too does the chance that the random helping causes significant cache latency.
    319344The solution is for the scheduler to be aware of the cache topology.
     
    323348Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem.
    324349This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory.
    325 This leaves the challenge of matching \procs to cache structure, or more precisely identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.
     350This leaves the challenge of matching \procs to cache structure, or more precisely, identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.
    326351Once a match is generated, the helping algorithm is changed to add bias so that \procs more often help sub-queues local to the same cache substructure.\footnote{
    327 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning.}
     352Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning beyond the order of magnitude.}
    328353
    329354The simplest approach for mapping sub-queues to cache structure is to statically tie sub-queues to CPUs.
     
    335360However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}.
    336361In this case, the large number of sub-queues and the bias against sub-queues tied to different cache substructures make it unlikely that every sub-queue is picked.
    337 To make things worst, the small number of \procs means that few helping attempts are made.
     362To make things worse, the small number of \procs means that few helping attempts are made.
    338363This combination of low selection and few helping attempts allow a \at to become stranded on a sub-queue for a long time until it gets randomly helped.
    339 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue.
     364On a system with 2 \procs, 256 \glspl{hthrd}, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue.
     365In this scenario, where each \proc attempts to help on 50\% of dequeues, the probability that a remote sub-queue gets help is $\frac{1}{51200}$ and follows a geometric distribution.
     366Therefore the probability of the remote sub-queue gets help within the next 100'000 dequeues is only 85\%.
     367Assuming dequeues happen every 100ns, there is still 15\% chance a \at could starve for more than 10ms and a 1\% chance the \at starves for 33.33ms, the maximum latency tolerated for interactive applications.
     368If few \glspl{hthrd} share each cache instance, the probability that a \at is on a remote sub-queue becomes high.
    340369Therefore, a more dynamic match of sub-queues to cache instances is needed.
    341370
     
    343372\label{s:TopologicalWorkStealing}
    344373The approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to.
    345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
     374This tracking requires some finesse, because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.
    346375A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.
    347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance.
    348 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s).
     376Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, a \proc queries the CPU id and the corresponding cache instance.
     377Since sub-queues are tied to \procs, a \proc can then update the cache instance mapped to the local sub-queue(s).
    349378To avoid unnecessary cache line invalidation, the map is only written-to if the mapping changes.
    350379
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r82a90d4 rddcaff6  
    11\chapter{Macro-Benchmarks}\label{macrobench}
    2 The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
     2The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.
    33The next step is to demonstrate performance stays true in more realistic and complete scenarios.
    4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.
     4Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments.
    55
    66Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products.
     
    1010As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios.
    1111
     12The most obvious performance metric for web servers is throughput.
     13This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up.
     14Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
     15Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
     16
    1217\section{Memcached}
    1318Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}.
     
    2631Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}.
    2732\item
     33The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24  \glspl{hthrd})
     34\item
    2835A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively.
    2936\item
     
    4754\item
    4855For UDP connections, all the threads listen to a single UDP socket for incoming requests.
    49 Threads that are not currently dealing with another request ignore the incoming packet.
     56Threads that are currently dealing with another request ignore the incoming packet.
    5057One of the remaining, non-busy, threads reads the request and sends the response.
    5158This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request.
     
    7986\subsection{Throughput} \label{memcd:tput}
    8087This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment.
    81 The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
     88The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.
    8289Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers.
    8390
     
    104111
    105112\subsection{Tail Latency}
    106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?
    107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.
    108113Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment.
    109114
    110115Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines.
    111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed.
     116As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed.
    112117Because of this dramatic increase, the Y-axis is presented using a log scale.
    113118Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment.
     
    186191web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.}
    187192The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment.
    188 
    189 \subsection{NGINX threading}
    190193NGINX is a high-performance, \emph{full-service}, event-driven web server.
    191194It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}.
    192195This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance.
    193196The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
    194 When running as a static web server, it uses an event-driven architecture to service incoming requests.
     197In comparison, the custom \CFA web server was developed specifically with this experiment in mind.
     198However, nothing seems to indicate that NGINX suffers from the increased flexibility.
     199When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve.
     200
     201\subsection{NGINX threading}
     202When running as a static web server, NGINX uses an event-driven architecture to service incoming requests.
    195203Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines.
    196204For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections.
    197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.
     205Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io.
    198206The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool.
    199207However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads.
     
    262270The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate.
    263271\item
     272Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates.
     273\item
    264274Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively.
    265275\item
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r82a90d4 rddcaff6  
    44This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
    55All of these systems have a \gls{uthrding} model.
    6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
     6The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees.
    77Note that only the code of the \CFA tests is shown;
    8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
     8all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself.
    99
    1010\section{Benchmark Environment}\label{microenv}
     
    129129        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    130130        For throughput, higher is better, for scalability, lower is better.
    131         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     131        Each series represents 15 independent runs.
     132        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    132133        \label{fig:cycle:jax}
    133134\end{figure}
     
    161162        \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    162163        For throughput, higher is better, for scalability, lower is better.
    163         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     164        Each series represents 15 independent runs.
     165        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    164166        \label{fig:cycle:nasus}
    165167\end{figure}
     
    177179Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc.
    178180\CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
    179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worst performance compared to the left column.
     181Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column.
    180182This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
    181183This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
     
    185187Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details.
    186188Note the maximum of the Y-axis on Intel and AMD differ significantly.
    187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
     189Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability.
    188190However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
    189191The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
     
    191193This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
    192194Again, the same performance increase for libfibre is visible when running fewer \ats.
    193 Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
     195I did not investigate the libfibre performance boost for 1 cycle in this experiment.
    194196
    195197The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark.
    196198Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
    197199In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
    198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
     200For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees.
    199201
    200202\section{Yield}
     
    250252        \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    251253        For throughput, higher is better, for scalability, lower is better.
    252         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     254        Each series represents 15 independent runs.
     255        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    253256        \label{fig:yield:jax}
    254257\end{figure}
     
    309312        \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    310313        For throughput, higher is better, for scalability, lower is better.
    311         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     314        Each series represents 15 independent runs.
     315        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    312316        \label{fig:yield:nasus}
    313317\end{figure}
     
    317321Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling.
    318322Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs.
    319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.
     323This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed.
    320324Go and Tokio still display the same performance collapse as on Intel.
    321325Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine.
     
    324328
    325329It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently.
    326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
     330The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
    327331
    328332
     
    333337
    334338The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
    335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
     339With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
    336340This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
    337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other's data.
     341Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data.
    338342In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly.
    339343
     
    392396        \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    393397        For throughput, higher is better, for scalability, lower is better.
    394         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     398        Each series represents 15 independent runs.
     399        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    395400        \label{fig:churn:jax}
    396401\end{figure}
     
    404409Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs.
    405410Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values.
    406 Finally, Go gets the most peculiar results, scaling worst than other runtimes until 48 \procs.
     411Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs.
    407412At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing.
    408413However, beyond this point Go keeps this level of variation but does not scale further in any of the runs.
    409414
    410 Throughput and scalability are notably worst for all runtimes than the previous benchmarks since there is inherently more communication between processors.
     415Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors.
    411416Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second.
    412417Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling.
    413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worst and performance completely ceases to improve.
     418However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve.
    414419At this point, the benchmark is dominated by inter-socket communication costs for all runtimes.
    415420
     
    457462        \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    458463        For throughput, higher is better, for scalability, lower is better.
    459         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     464        Each series represents 15 independent runs.
     465        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    460466        \label{fig:churn:nasus}
    461467\end{figure}
     
    600606        \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    601607        For throughput, higher is better, for scalability, lower is better.
    602         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     608        Each series represents 15 independent runs.
     609        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    603610        \label{fig:locality:jax}
    604611\end{figure}
     
    632639        \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    633640        For throughput, higher is better, for scalability, lower is better.
    634         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     641        Each series represents 15 independent runs.
     642        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    635643        \label{fig:locality:nasus}
    636644\end{figure}
     
    648656Go still has the same poor performance as on Intel.
    649657
    650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
     658Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
    651659Go still has the same poor performance.
    652660
     
    751759\end{centering}
    752760\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
     761For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time).
    753762DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
    754763\label{fig:transfer:res}
     
    764773
    765774The first two columns show the results for the semaphore variation on Intel.
    766 While there are some differences in latencies, \CFA is consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
    767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.
     775While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
     776Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal.
    768777
    769778Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
     
    780789Neither Libfibre nor Tokio complete the experiment.
    781790
    782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.
     791This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee.
    783792The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly.
    784793Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader.
     
    790799Without \procs stealing from the \proc running the leader, the experiment cannot terminate.
    791800Go manages to complete the experiment because it adds preemption on top of classic work-stealing.
    792 However, since preemption is fairly infrequent, it achieves significantly worst performance.
     801However, since preemption is fairly infrequent, it achieves significantly worse performance.
    793802In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness.
    794803Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r82a90d4 rddcaff6  
    22As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa.
    33Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc.
    4 Scheduling is also common in most other fields, \eg in assembly lines, assigning parts to line workers is a form of scheduling.
     4Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling.
    55
    66In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler.
     
    88A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input.
    99This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.
    10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
    11 
    12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods.
    13 When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
     10This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.
     11
     12Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods.
     13When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.
    1414
    1515\section{Naming Convention}
    16 Scheduling has been studied by various communities concentrating on different incarnations of the same problems.
     16Scheduling has been studied by various communities, concentrating on different incarnations of the same problems.
    1717As a result, there are no standard naming conventions for scheduling that are respected across these communities.
    1818This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats.
    1919
    2020\section{Static Scheduling}
    21 \newterm{Static schedulers} require \ats dependencies and costs to be explicitly and exhaustively specified prior to scheduling.
     21\newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling.
    2222The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution.
    2323This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information.
     
    2727
    2828\section{Dynamic Scheduling}
    29 \newterm{Dynamic schedulers} determine \at dependencies and costs during scheduling, if at all.
    30 Hence, unlike static scheduling, \at dependencies are conditional and detected at runtime.
    31 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies.
     29\newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all.
     30This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies.
    3231Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled.
    3332As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies.
     
    3534\subsection{Explicitly Informed Dynamic Schedulers}
    3635While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc.
    37 When available, a scheduler can then use this information to direct the scheduling decisions.
     36When available, a scheduler can then use this information to direct scheduling decisions.
    3837For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
    3938However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
     
    4544
    4645\subsubsection{Priority Scheduling}
    47 Common information used by schedulers to direct their algorithm is priorities.
     46A common approach to direct the scheduling algorithm is to add information about \at priority.
    4847Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones.
    4948The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority.
     
    8180\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
    8281In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
    83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
     82However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others.
    8483A common example where this is statically true is architectures with \glsxtrshort{numa}.
    8584In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}.
     
    8887\paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures.
    8988This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
    90 As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis.
    91 Although it could be an interesting avenue for future work.
     89As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis.
     90However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant.
    9291
    9392\subsection{Theoretical Results}
    9493There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}.
    95 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
     94Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.
    9695Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}.
    9796\cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats.
     
    104103Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application.
    105104There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration.
    106 While this helps schedulers guarantee that no \ats unfairly monopolize a worker, preemption can effectively be added to any scheduler.
     105This helps schedulers guarantee that no \ats unfairly monopolize a worker.
     106Preemption can effectively be added to any scheduler.
    107107Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it.
    108108
     
    110110This section presents a quick overview of several current schedulers.
    111111While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers.
    112 As such, I believe these schedulers are at least as relevant as those presented in published work.
    113 Both Schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project.
    114 However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project.
     112As such, I believe these schedulers are as relevant as those presented in published work.
     113Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project.
     114However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines.
     115These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project.
     116As such real-time schedulers are not considered for this work.
    115117
    116118\subsection{Operating System Schedulers}
    117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
     119Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.
    118120Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X.
    119121The information is less complete for closed source operating systems.
     
    137139It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach.
    138140
    139 \paragraph{Windows(OS)}
     141\paragraph{Windows (OS)}
    140142Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities.
    141143It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications.
     
    143145The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
    144146
    145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.
     147The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs''.
    146148Multicore scheduling is based on a combination of priorities and \proc preference.
    147149Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
     
    168170\paragraph{Go}\label{GoSafePoint}
    169171Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}.
    170 Preemption is present, but only at safe points,~\cite{go:safepoints} which are detection code inserted at various frequent access boundaries.
     172Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries.
    171173
    172174The algorithm is as follows :
    173175\begin{enumerate}
    174176        \item Once out of 61 times, pick 1 element from the \emph{GRQ}.
    175         \item If there is an item in the ``chair'' pick it.
     177        \item Otherwise, if there is an item in the ``chair'' pick it.
    176178        \item Else pick an item from the \emph{LRQ}.
    177179        \begin{itemize}
     
    180182        \end{itemize}
    181183\end{enumerate}
     184
     185Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation.
    182186
    183187\paragraph{Erlang}
     
    224228\paragraph{LibFibre}
    225229LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo.
    226 Similarly to Go, it uses a variation of work stealing with a global queue that has a higher priority than stealing.
    227 Unlike Go, it does not have the high-priority next ``chair'' and does not use randomized work-stealing.
     230It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing.
     231Unlike Go, it does not have the high-priority next ``chair'' and its work-stealing is not randomized.
     232
     233Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation.
  • doc/theses/thierry_delisle_PhD/thesis/text/front.tex

    r82a90d4 rddcaff6  
    8888        \\
    8989        Internal Member: \> Martin Karsten \\
    90         \> Associate Professor, School of Computer Science \\
     90        \> Professor, School of Computer Science \\
    9191        \> University of Waterloo \\
    9292\end{tabbing}
     
    127127Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    128128To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    129 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     129which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    130130Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    131 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources.
     131When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources?
    132132Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
    133133otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r82a90d4 rddcaff6  
    11\chapter{Introduction}\label{intro}
    22
    3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
     3\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages, like Go~\cite{GITHUB:go}, Java's Project Loom~\cite{MAN:project-loom} and Kotlin~\cite{MAN:kotlin}.
    44The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems.
    55Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
    66To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
    7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     7which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    88Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
     9When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources?
    1010Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
    1111otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
     
    1616Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    1717
    18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
     18After examining, testing and selecting specific approaches to these scheduling issues, a new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.
    1919The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance.
    20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness.
     20The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
    2121
    2222Chapter~\ref{intro} defines scheduling and its general goals.
     
    3232Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
    3333On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}.
    34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
     34Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit\footnote{
     35Open systems constrasts to \newterm{closed} systems, where work never arrives from external sources.
     36This definition is a extension of open/closed systems in the field of thermodynamics.
     37}.
     38
    3539In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
    3640For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
     
    9498\end{enumerate}
    9599
    96 Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
     100At a high-level, scheduling is considered zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
    97101Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.}
    98 Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
     102This almost invariably leads to schedulers needing to pick some \ats over others, opening the door to fairness problems.
     103However, at a lower-level, schedulers can make inefficient or incorrect decisions leading to strictly worse outcomes than what the theoretical zero-sum game suggests.
     104Since it can be difficult to avoid these poor decisions, schedulers are generally a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
    99105For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
    100106While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
     
    113119Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling.
    114120Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{
    115 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}
     121C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd}, reducing parallelism at the user level.}
    116122
    117123Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals.
    118 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
     124More specifically, safety and productivity for scheduling mean supporting a wide range of workloads, so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
    119125The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
    120126To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application.
    121127
    122 As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.
     128As a research project, this work runs exclusively on newer versions of the Linux operating system and gcc/clang compilers.
    123129The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
    124130some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
     
    138144An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
    139145\item
    140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
     146A mechanism for adding fairness on top of the MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.
    141147\item
    142148An optimization of the helping mechanism for load balancing to reduce scheduling costs.
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r82a90d4 rddcaff6  
    11\chapter{User Level \io}\label{userio}
    22As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
     3I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources.
     4When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}).
    35Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
    46
     
    1618This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
    1719
    18 There are three options to monitor file descriptors in Linux:\footnote{
     20There are three options to monitor file descriptors (FD) in Linux:\footnote{
    1921For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
    2022The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
     
    2527\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
    2628Hence, the array length must be as long as the largest FD currently of interest.
    27 On return, it outputs the set motified in place to identify which of the file descriptors changed state.
     29On return, it outputs the set modified in-place to identify which of the file descriptors changed state.
    2830This destructive change means selecting in a loop requires re-initializing the array for each iteration.
    29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
     31Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
    3032Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
    3133However, these changes are only reflected when the manager makes its next call to @select@.
     
    4648However, all three of these I/O systems have limitations.
    4749The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
    48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.
     50Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed.
    4951For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@.
    5052Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
     
    5254\subsection{POSIX asynchronous I/O (AIO)}
    5355An alternative to @O_NONBLOCK@ is the AIO interface.
    54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
    55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
    56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
    57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
    58 Leaving polling for completion, which is similar to the previous system calls.
    59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
    60 It also supports batching multiple operations in a single system call.
    61 
    62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
    63 For \io multiplexing, @aio_suspend@ is the best interface.
    64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
     56Using AIO, programmers can enqueue operations which are to be performed
     57asynchronously by the kernel.
     58The kernel can communicate
     59completions of these operations in three ways:
     60it can spawn a new \gls{kthrd}; send a Linux signal; or
     61userspace can poll for completion of one or more operations.
     62Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution.
     63Signals and their associated interrupt handlers can also lead to fairly complicated
     64interactions between subsystems, and they have a non-trivial cost.
     65This leaves a single option: polling for completion---this is similar to the previously discussed
     66system calls.
     67While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file
     68descriptors can be regular files or block devices.
     69AIO also supports batching multiple operations in a single system call.
     70
     71AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
     72Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing.
     73However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
    6574AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
    6675This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
     
    92101A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
    93102Like AIO, it represents \io operations as entries added to a queue.
    94 But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
    95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
     103But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
     104The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion.
    96105
    97106One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
    98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
     107In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
    99108
    100109On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
     
    106115This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    107116This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
    108 As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
     117Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
    109118
    110119\subsection{Discussion}
    111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
    112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
    113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
    114 
    115 For this project, I selected @io_uring@, in large parts because of its generality.
     120These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete.
     121All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically.
     122For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
     123
     124For this project, I selected @io_uring@, in large part because of its generality.
    116125While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
    117126
    118127\section{Event-Engine}
    119128An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
     129In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}.
    121130The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
    122131
     
    125134Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
    126135Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
    127 The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
    128 The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
     136The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
     137The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
    129138The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
    130139the completion ring contains entries for the completed I/O operation.
     
    134143        \centering
    135144        \input{io_uring.pstex_t}
    136         \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
     145        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left).
     146        While the completion ring contains plain data, the submission ring contains only references.
     147        These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.}
    137148        \label{fig:iouring}
    138149\end{figure}
     
    153164Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
    154165Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
    155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
     166Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted.
    156167\item
    157168The kernel is notified of the change to the ring using the system call @io_uring_enter@.
    158169The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
    159 The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
     170The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions.
    160171\end{enumerate}
    161172
     
    165176When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
    166177Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
    167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
    168 The system call is only needed if the application wants to block waiting for operations to complete.
     178The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring.
     179@io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions.
    169180\end{sloppypar}
    170181
     
    179190This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    180191
    181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
     192An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
    182193Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
    183194In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
    184195To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
    185 Three kinds of operations that can need the \glspl{kthrd}:
     196Three kinds of operations that can need the \glspl{kthrd} are:
    186197
    187198\paragraph{Operations using} @IOSQE_ASYNC@.
     
    190201\paragraph{Bounded operations.}
    191202This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
    192 @io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
     203Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices.
    193204In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
    194205
    195206\paragraph{Unbounded operations that must be retried.}
    196207While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
    197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
     208Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel.
    198209@io_uring@ maintains a separate pool for these operations.
    199210The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
     211Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel.
    200212Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
    201213
     
    212224however, the duration of the system call scales with the number of entries submitted.
    213225The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
    214 Beyond this limit, the length of the system call is the throughput limiting factor.
     226Beyond this limit, the length of the system call is the throughput-limiting factor.
    215227I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
    216228Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
    217229Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
    218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
    219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
    220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance.
     230As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}.
     231Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance.
    221232
    222233A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
    223234SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
    224235The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
    225 Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
     236For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable.
    226237Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
    227238
     
    237248To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
    238249From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
    239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
     250In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}.
    240251Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
    241252Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
     
    244255        \centering
    245256        \input{pivot_ring.pstex_t}
    246         \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.
     257        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition.
    247258        When submitting, the partition is advanced.
    248259        The kernel considers the partition as the head of the ring.}
     
    253264However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
    254265Instances running out of SQEs cannot run \ats wanting to do \io operations.
    255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
     266In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur.
    256267
    257268A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
    258 \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
     269\Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    259270While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
    260271
     
    263274In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
    264275No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
    265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
     276However, the \io \gls{proc} is unable to help because it is executing the spinning \at.
     277This results in a deadlock.
    266278While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
    267279Furthermore, this pattern is difficult to reliably detect and avoid.
     
    274286\subsubsection{Public Instances}
    275287The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
    276 \ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
     288\Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance.
    277289Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
    278290Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
     
    282294\item
    283295The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
    284 This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
     296This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
    285297\end{itemize}
    286298
    287299Allocation in this scheme is fairly easy.
    288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.
     300Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs.
    289301Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
    290302The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
     
    312324Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    313325If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
    315 
    316 With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
     326A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
     327
     328The big advantage of the pool of SQE instances approach is that it is fairly flexible.
    317329It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    318330It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
     
    320332The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    321333The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
     334All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases.
    323335
    324336\subsubsection{Instance borrowing}
    325337Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
    326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
    327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.
     338The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling.
     339The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids.
    328340When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
    329341However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
     
    332344While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
    333345
     346As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm.
    334347In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
    335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
     348When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance.
    336349This instance is now bound to the \gls{proc} the \at is running on.
    337350This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
     
    343356        \item The current \gls{proc} does not hold an instance.
    344357        \item The current instance does not have sufficient SQEs to satisfy the request.
    345         \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
     358        \item The current \gls{proc} has a wrong instance.
     359        This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}.
    346360\end{enumerate}
    347361However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
    348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}
     362Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.}
    349363A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
    350364If not, it proceeds, otherwise it delegates the operation to the arbiter.
     
    365379However, there is no need to immediately revoke the instance.
    366380External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
    367 This means whoever is responsible for the system call, first checks if the instance has any external submissions.
     381This means whoever is responsible for the system call first checks whether the instance has any external submissions.
    368382If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
    369383
     
    382396
    383397\section{Interface}
    384 The last important part of the \io subsystem is its interface.
     398The final part of the \io subsystem is its interface.
    385399Multiple approaches can be offered to programmers, each with advantages and disadvantages.
    386 The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
    387 The following sections discuss some useful options using @read@ as an example.
     400The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
     401The following sections discuss some useful options, using @read@ as an example.
    388402The standard Linux interface for C is:
    389403\begin{cfa}
     
    392406
    393407\subsection{Replacement}
    394 Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
    395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    396 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    397 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
    398 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
    400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
     408Replacing the C \io subsystem is the more intrusive and draconian approach.
     409The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's.
     410This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation.
     411It also offers a presumably well known and familiar API that C programmers can simply continue to work with.
     412%However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement.
     413However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs.
     414This approach was rejected as being laudable but infeasible.
    401415
    402416\subsection{Synchronous Extension}
    403417Another interface option is to offer an interface different in name only.
     418In this approach, an alternative call is created for each supported system calls.
    404419For example:
    405420\begin{cfa}
    406421ssize_t cfa_read(int fd, void *buf, size_t count);
    407422\end{cfa}
     423The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking.
     424
    408425This approach is feasible and still familiar to C programmers.
    409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
     426It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries.
    410427However, it has the advantage of implementation simplicity.
    411428Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
     
    416433future(ssize_t) read(int fd, void *buf, size_t count);
    417434\end{cfa}
    418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
     435where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested.
    419436The data read is placed in @buf@.
    420 The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
    421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
     437The problem is that both the bytes count and data form the synchronization object, not just the bytes read.
     438Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this.
    422439A classical asynchronous API is:
    423440\begin{cfa}
     
    438455However, it is not the most user-friendly option.
    439456It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
     457
     458As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension:
     459\begin{cfa}
     460ssize_t cfa_read(int fd, void *buf, size_t count);
     461future(ssize_t) async_read(int fd, void *buf, size_t count);
     462\end{cfa}
  • doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

    r82a90d4 rddcaff6  
    22The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state.
    33This chapter addresses problems that occur when the system state changes.
    4 Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
     4Indeed the \CFA runtime supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
    55These changes affect the scheduling algorithm, which must dynamically alter its behaviour.
    66
    7 In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
     7Specifically, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
    88\begin{cfa}
    99{
     
    2626This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason.
    2727
    28 There are no performance requirements, within reason, for resizing since it is expected to be rare.
    29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.
    30 It should also avoid as much as possible any effect on performance when the number of \procs remains constant.
    31 This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
     28There are no performance requirements, within reason, for act of resizing itself, since it is expected to be rare.
     29However, this operation has strict correctness requirements, since updating and idle sleep can easily lead to deadlocks.
     30The resizing mechanism should also avoid, as much as possible any effect on performance when the number of \procs remains constant.
     31This last requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
    3232
    3333\subsection{Read-Copy-Update}
    3434One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}.
     35This is a very common pattern that avoids large critical sections, which is why it is worth mentioning.
    3536In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy.
    36 This approach has the advantage that it may not need any synchronization to do the switch.
     37This approach has the advantage that it may not need any synchronization to do the switch, depending on how reclamation of the original copy is handled.
    3738However, there is a race where \procs still use the original data structure after the copy is switched.
    3839This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy.
     
    6364Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
    6465Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
    65 The lock in nonblocking, so both readers and writers spin while the lock is held.
     66The lock is nonblocking, so both readers and writers spin while the lock is held.
    6667This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations.
    6768
     
    9899\section{Idle-Sleep}\label{idlesleep}
    99100While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs.
    100 For this work, it is the programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
     101For this work, it is the application programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
    101102This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
    102103These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
     
    108109Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
    109110Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore.
    110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
     111The complexity here is to support user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
    111112Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
    112113However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
     
    115116An interesting subpart of this heuristic is what to do with bursts of \ats that become ready.
    116117Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up.
    117 This fact begs the question, if many \procs are available, how many should be woken?
     118This fact raises the question: if many \procs are available, how many should be woken?
    118119If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization.
    119120If the ready \ats will run for a very short time, waking many \procs may be wasteful.
    120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified.
     121As mentioned, since a heuristic to handle these complex cases is outside the scope of this thesis, so the behaviour of the scheduler in this particular case is left unspecified.
    121122
    122123\section{Sleeping}
     
    156157The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc.
    157158On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed.
    158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed.
     159Unlike regular work-stealing, this search must be exhaustive to make sure that no pre-existing \at is missed.
    159160These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc.
    160161Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search.
     
    188189        \centering
    189190        \input{idle1.pstex_t}
    190         \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
     191        \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put onto a doubly-linked stack protected by a lock.
    191192        Each \proc has a private event \lstinline{fd}.}
    192193        \label{fig:idle1}
  • doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

    r82a90d4 rddcaff6  
    66\Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@.
    77Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
    8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.
     8While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does.
    99This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel.
    1010The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking.
     
    3838
    3939\section{\glsxtrshort{io}}\label{prev:io}
    40 Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
     40Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
    4141
    4242\begin{quote}
     
    4949\end{quote}
    5050
    51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
     51Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
    5252This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}.
    5353The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
     
    5656
    5757\section{Interoperating with C}
    58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
     58While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls.
     59The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
    5960\begin{quote}
    6061All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
    6162\end{quote}
    62 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.
     63Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model.
    6364
    64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
     65Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}.
     66Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
    6567
    66 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences:
     68As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions.
     69Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together.
     70This fine-grained interoperability between C and \CFA has two consequences:
    6771\begin{enumerate}
    6872        \item Precisely identifying blocking C calls is difficult.
     
    7175Because of these consequences, this work does not attempt to ``sandbox'' calls to C.
    7276Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model.
     77Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs.
    7378Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications.
    74 Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.}
     79Therefore, a complete solution to this problem is outside the scope of this thesis.
     80\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks.
     81This capability further reduces the chances of blocking a \gls{kthrd}.}
     82Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}.
Note: See TracChangeset for help on using the changeset viewer.