# Changeset ddcaff6

Ignore:
Timestamp:
Nov 24, 2022, 3:41:44 PM (2 months ago)
Branches:
master
Children:
dacd8e6
Parents:
82a90d4
Message:

Last corrections to my thesis... hopefully

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
12 edited

Unmodified
Removed
• ## doc/theses/thierry_delisle_PhD/thesis/fig/base_ts2.fig

 r82a90d4 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2 2400 4950 3000 4950 4 2 -1 50 -1 0 12 0.0000 2 135 645 2100 3075 Threads\001 4 2 -1 50 -1 0 12 0.0000 2 180 525 2100 2850 Ready\001 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 4200 Array of\001 4 2 -1 50 -1 0 12 0.0000 2 165 600 2100 4425 Queues\001 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 3550 TS\001 4 2 -1 50 -1 0 12 0.0000 2 180 660 2100 5025 Array of\001 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 4450 MA\001 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 4225 TS\001 4 1 -1 50 -1 0 11 0.0000 2 120 300 2700 5350 MA\001 4 1 -1 50 -1 0 11 0.0000 2 120 210 2700 5125 TS\001 4 2 -1 50 -1 0 12 0.0000 2 180 1590 2100 5250 Timestamps Copies\001 4 2 -1 50 -1 0 12 0.0000 2 135 840 2100 6075 Processors\001 4 2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001 4 2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001 4 2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 5025 Array of\001 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4225 TS\001 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5350 MA\001 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5125 TS\001 4 2 -1 50 -1 0 12 0.0000 2 135 900 2100 6075 Processors\001 4 2 -1 50 -1 0 12 0.0000 2 165 1440 2100 5250 Timestamp Copies\001
• ## doc/theses/thierry_delisle_PhD/thesis/local.bib

 r82a90d4 } @book{MAN:inteldev, key = {Intel 64 and IA-32 Architectures Software Developer’s Manual}, title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual}, publisher = {Intel{\textregistered}}, year = {2016}, volume = {3B: System Programming Guide, Part 2}, url = {\href{https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html}{https://\-www.intel.com/\-content/\-www/\-us/\-en/\-architecture\--and\--technology/\-64\--ia\--32\--architectures\--software\--developer\--vol\--3b\--part\--2\--manual\-.html}, } @misc{MemcachedThreading, author = {Oracle}, } @misc{MAN:kotlink, howpublished = {\href{https://kotlinlang.org/docs/multiplatform-mobile-concurrency-and-coroutines.html}{https://\-kotlinlang.org\-/docs\-/multiplatform\--mobile\--concurrency\--and\--coroutines.html}} } @misc{MAN:java/fork-join, howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}} howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", note = "[Online; accessed 7-September-2022]" } @misc{wiki:rdtsc, author = "{Wikipedia contributors}", title = "Time Stamp Counter --- {W}ikipedia{,} The Free Encyclopedia", year = "2022", howpublished = "\href{https://en.wikipedia.org/wiki/Time\_Stamp\_Counter}{https://\-en.wikipedia.org/\-wiki/\-Time\-\_Stamp\-\_Counter}", note = "[Online; accessed 14-November-2022]" } @misc{wiki:lockfree, author = "{Wikipedia contributors}", title = "Non-blocking algorithm --- {W}ikipedia{,} The Free Encyclopedia", year = "2022", howpublished = "\href{https://en.wikipedia.org/wiki/Non-blocking_algorithm}{https://en.wikipedia.org/\-wiki/Non\--blocking\-\_algorithm}", note = "[Online; accessed 22-November-2022]" } @misc{wiki:lockfree, author = "{Wikipedia contributors}", title = "Expected value --- {W}ikipedia{,} The Free Encyclopedia", year = "2022", howpublished = "\href{https://en.wikipedia.org/wiki/Expected_value}{https://en.wikipedia.org/\-wiki/\-Expected\-\_value}", note = "[Online; accessed 22-November-2022]" } @misc{wiki:softirq, author = "{Wikipedia contributors}", title = "Interrupt --- {W}ikipedia{,} The Free Encyclopedia", year = "2022", howpublished = "\href{https://en.wikipedia.org/wiki/Interrupt}{https://en.wikipedia.org/\-wiki/\-Interrupt}", note = "[Online; accessed 24-November-2022]" }
• ## doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

 r82a90d4 Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system. Nonblocking I/O should not be handled in this way. Presumably, this is better handled by Windows's overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work. \section{Goals} As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime. Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime. Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. I am aware there is a host of low-power research that could be tapped here. \subsection{CPU Workloads} A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily. Examining such workloads in the context of scheduling would be interesting. However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work. \subsection{Hardware} One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness.
• ## doc/theses/thierry_delisle_PhD/thesis/text/core.tex

 r82a90d4 In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware. This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. This expectation of \at independence means the scheduler is expected to offer two guarantees: This expectation of \at independence means the scheduler is expected to offer two features: \begin{enumerate} \item A fairness guarantee: a \at that is ready to run is not prevented by another thread. \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same. \item A fairness guarantee: a \at that is ready to run is not prevented by another thread indefinitely, \ie, starvation freedom. This is discussed further in the next section. \item A performance goal: given a \at that wants to start running, other threads wanting to do the same do not interfere with it. \end{enumerate} It is important to note that these guarantees are expected only up to a point. \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources. Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point. Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. The performance goal, the lack of interference among threads, is only desired up to a point. Ideally, the cost of running and blocking should be constant regardless of contention, but the goal is considered satisfied if the cost is not \emph{too high} with or without contention. How much is an acceptable cost is obviously highly variable. For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models. Recall programmer expectation is that the impact of the scheduler can be ignored. Therefore, if the cost of scheduling is competitive with other popular languages, the guarantee is considered achieved. Therefore, if the cost of scheduling is competitive with other popular languages, the goal is considered satisfied. More precisely the scheduler should be: \begin{itemize} \item As fast as other schedulers that are less fair. \item Faster than other schedulers that have equal or better fairness. \item As fast as other schedulers without any fairness guarantee. \item Faster than other schedulers that have equal or stronger fairness guarantees. \end{itemize} \subsection{Fairness Goals} For this work, fairness is considered to have two strongly related requirements: true starvation freedom and fast'' load balancing. \paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. For this work, fairness is considered to have two strongly related requirements: \paragraph{Starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. Starvation freedom can be bounded or unbounded. In the bounded case, all \ats should be able to run within a fix bound, relative to its own enqueue. Whereas unbounded starvation freedom only requires the \at to eventually run. The \CFA scheduler aims to guarantee unbounded starvation freedom. In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks. Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case. Now, this requirement begs the question, what about preemption? Without preemption, traditional work-stealing schedulers do not have starvation freedom, bounded or unbounded. Now, this requirement raises the question, what about preemption? Generally speaking, preemption happens on the timescale of several milliseconds, which brings us to the next requirement: fast'' load balancing. \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow. For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved. Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale. With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system. \paragraph{Fast load balancing} means that while eventual progress is guaranteed, it is important to mention on which timescale this progress is expected to happen. Indeed, while a scheduler with bounded starvation freedom is beyond the scope of this work, offering a good expected bound in the mathematical sense~\cite{wiki:expected} is desirable. The expected bound on starvation freedom should be tighter than what preemption normally allows. For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait milliseconds to run are effectively starved. Therefore load-balancing should be done at a faster pace: one that is expected to detect starvation at the microsecond scale. \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal} For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available. Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as a consequence cache lines, to a \gls{hthrd} that is currently available. Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler, versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data. External locality is a much more complicated subject and is discussed in the next section. However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally. Figure~\ref{fig:fair} shows a visual representation of this behaviour. As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental model. However, I claim that in practice it is possible to strike a balance between fairness and performance because these requirements do not necessarily overlap temporally. Figure~\ref{fig:fair} shows a visual representation of this effect. As mentioned, some unfairness is acceptable; for example, once the bounded starvation guarantee is met, additional fairness will not satisfy it \emph{more}. Inversely, once a \at's data is evicted from cache, its locality cannot worsen. Therefore it is desirable to have an algorithm that prioritizes cache locality as long as the fairness guarantee is also satisfied. \begin{figure} \subsection{Performance Challenges}\label{pref:challenge} While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges. Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design. Since these challenges are recurring themes in the design of a scheduler it is relevant to describe them here before looking at the scheduler's design. \subsubsection{Latency} The most basic performance metric of a scheduler is scheduling latency. This measures the how long it takes for a \at to run once scheduled, including the cost of scheduling itself. This measure include both the sequential cost of the operation itself, both also the scalability. \subsubsection{Scalability} The most basic performance challenge of a scheduler is scalability. Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats. Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats relative to the available parallelism. One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements. While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention. In the Chapter~\ref{microbench}, scalability is measured as $\# procs \times \frac{ns}{ops}$, \ie, number of \procs times total time over total operations. Since the total number of operation should scale with the number of \procs, this gives a measure how much each additional \proc affects the other \procs. \subsubsection{Migration Cost} In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance. The problem is a single point of contention when adding/removing \ats. As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering. Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis. As is shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue. The sub-queues are accessed by multiple \glspl{hthrd} without the need for communication. Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration for the scheduler in this thesis. \subsection{Work-Stealing} As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing. In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue. The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. In this case, work stealing is close to optimal scheduling: it can achieve perfect locality and have no contention. On the other hand, work-stealing schedulers only attempt to do load-balancing when a \gls{proc} runs out of work. The interesting aspect of work stealing manifests itself in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. In this case, work stealing is close to optimal scheduling latency: it can achieve perfect locality and have no contention. On the other hand, work-stealing only attempts to do load-balancing when a \gls{proc} runs out of work. This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work. Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to indefinite starvation. Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, but load balance proactively when the need arises. Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to unbounded starvation. Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, \ie, avoid migrations in well balanced workloads, but load balance proactively when the need arises. \subsection{Relaxed-FIFO} A different scheduling approach is to create a relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}. A different scheduling approach is the relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}. This approach forgoes any ownership between \gls{proc} and sub-queue, and simply creates a pool of sub-queues from which \glspl{proc} pick. Scheduling is performed as follows: Timestamps are added to each element of a sub-queue. \item A \gls{proc} randomly tests sub-queues until it has acquired one or two queues. A \gls{proc} randomly tests sub-queues until it has acquired one or two queues, referred to as \newterm{randomly picking} or \newterm{randomly helping}. \item If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues. However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality. While this scheme has good fairness, its performance suffers. It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats. While this scheme has good fairness, its performance can be improved. Wide sharding is generally desired, \eg at least 4 queues per \proc, and randomly picking non-empty queues is difficult when there are few ready \ats. The next sections describe improvements I made to this existing algorithm. However, ultimately the relaxed-FIFO'' queue is not used as the basis of the \CFA scheduler. \section{Relaxed-FIFO++} The inherent fairness and good performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler. The inherent fairness and decent performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler. The problem case is workloads where the number of \ats is barely greater than the number of \procs. In these situations, the wide sharding of the ready queue means most of its sub-queues are empty. As this is the most obvious challenge, it is worth addressing first. The obvious solution is to supplement each sharded sub-queue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}. This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues. Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity. The seemingly obvious solution is to supplement each sharded sub-queue with data that indicates whether the queue is empty/nonempty. This simplifies finding nonempty queues, \ie ready \glspl{at}. The sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues, using a bit or a node per sub-queue, respectively. Specifically, many modern architectures have powerful bitmask manipulation instructions, and, searching a binary tree has good Big-O complexity. However, precisely tracking nonempty sub-queues is problematic. The reason is that the sub-queues are initially sharded with a width presumably chosen to avoid contention. However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded sub-queues. However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie tracks whether multiple sub-queues are empty. Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a sub-queue at random. But if the tracking mechanism \emph{is} denser than the shared sub-queues, then constant updates invariably create a new source of contention. \subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy} The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng}. The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} that drives the random picking of sub-queues. The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed. While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the sub-queues that were accessed. Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance. Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators~\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance. This particular \glsxtrshort{prng} can be used as follows: \begin{itemize} Each \proc maintains two \glsxtrshort{prng} states, referred to as $F$ and $B$. \item When a \proc attempts to dequeue a \at, it picks a sub-queue by running $B$ backwards. \item When a \proc attempts to enqueue a \at, it runs $F$ forward picking a sub-queue to enqueue to. If the enqueue is successful, state $B$ is overwritten with the content of $F$. When a \proc attempts to dequeue a \at, it picks a sub-queue by running its $B$ backwards. \item When a \proc attempts to enqueue a \at, it runs its $F$ forward picking a sub-queue to enqueue to. If the enqueue is successful, state of its $B$ is overwritten with the content of its $F$. \end{itemize} The result is that each \proc tends to dequeue \ats that it has itself enqueued. When most sub-queues are empty, this technique increases the odds of finding \ats at a very low cost, while also offering an improvement on locality in many cases. Tests showed this approach performs better than relaxed-FIFO in many cases. My own tests showed this approach performs better than relaxed-FIFO in many cases. However, it is still not competitive with work-stealing algorithms. The fundamental problem is that the constant randomness limits how much locality the scheduler offers. The fundamental problem is that the randomness limits how much locality the scheduler offers. This becomes problematic both because the scheduler is likely to get cache misses on internal data structures and because migrations become frequent. Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out. Before attempting to dequeue from a \proc's sub-queue, the \proc must make some effort to ensure other sub-queues are not being neglected. To make this possible, \procs must be able to determine which \at has been on the ready queue the longest. Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible. Second, the relaxed-FIFO approach uses timestamps, denoted TS, for each \at to make this possible. Theses timestamps can be added to work stealing. \begin{figure} \centering \input{base.pstex_t} \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of sub-queues offers the sharding, two per \proc. \caption[Base \CFA design]{Base \CFA design \smallskip\newline It uses a pool of sub-queues, with a sharding of two sub-queue per \proc. Each \gls{proc} can access all of the sub-queues. Each \at is timestamped when enqueued.} Figure~\ref{fig:base} shows the algorithm structure. This structure is similar to classic work-stealing except the sub-queues are placed in an array so \procs can access them in constant time. Sharding width can be adjusted based on contention. Note, as an optimization, the TS of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS. Sharding can be adjusted based on contention. As an optimization, the timestamp of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS. This organization keeps the highly accessed front TSs directly in the array. When a \proc attempts to dequeue a \at, it first picks a random remote sub-queue and compares its timestamp to the timestamps of its local sub-queue(s). The oldest waiting \at is dequeued to provide global fairness. The oldest waiting of the compared \ats is dequeued. In this document, picking from a remote sub-queue in this fashion is referred to as helping''. The timestamps are measured using the CPU's hardware timestamp counters~\cite{wiki:rdtsc}. These provide a 64-bit counter that tracks the number of cycles since the CPU was powered on. Assuming the CPU runs at less than 5 GHz, this means that the 64-bit counter takes over a century before overflowing. This is true even on 32-bit CPUs, where the counter is generally still 64-bit. However, on many architectures, the instructions to read the counter do not have any particular ordering guarantees. Since the counter does not depend on any data in the cpu pipeline, this means there is significant flexibility for the instruction to be read out of order, which limites the accuracy to a window of code. Finally, another issue that can come up with timestamp counters is synchronization between \glspl{hthrd}. This appears to be mostly a historical concern, as recent CPU offer more synchronization guarantees. For example, Intel supports "Invariant TSC" \cite[\S~17.15.1]{MAN:inteldev} which is guaranteed to be synchronized across \glspl{hthrd}. However, this na\"ive implementation has performance problems. First, it is necessary to have some damping effect on helping. Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help. These bursts can cause increased migrations and make this work-stealing approach slow down to the level of relaxed-FIFO. First, it is necessary to avoid helping when it does not improve fairness. Random effects like cache misses and preemption can add unpredictable but short bursts of latency but do not warrant the cost of helping. These bursts can cause increased migrations, at which point this same locality problems as in the relaxed-FIFO approach start to appear. \begin{figure} A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}. Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. Note that this is more complex than it can appear because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. Therefore, the exponential moving average is an average of how long each dequeued \at has waited. To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue. This new waiting is averaged with the stored average. To further limit \glslink{atmig}{migrations}, a bias can be added to a local sub-queue, where a remote sub-queue is helped only if its moving average is more than $X$ times the local sub-queue's average. Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations. Tests for this approach indicate the precise values for the weight of the moving average and the bias are not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. With these additions to work stealing, scheduling can satisfy the starvation freedom guarantee while suffering much less from unnecessary migrations than the relaxed-FIFO approach. Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term unfairness or no starvation. The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated, \ie, writes. To make things worst, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss. The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated. To make things worse, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss. Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue. This puts this algorithm in the awkward situation of paying for a largely unnecessary cost. The problem with polling remote sub-queues is that correctness is critical. There must be a consensus among \procs on which sub-queues hold which \ats, as the \ats are in constant motion. Furthermore, since timestamps are used for fairness, it is critical to have a consensus on which \at is the oldest. Furthermore, since timestamps are used for fairness, it is critical that the oldest \ats eventually be recognized as such. However, when deciding if a remote sub-queue is worth polling, correctness is less of a problem. Since the only requirement is that a sub-queue is eventually polled, some data staleness is acceptable. \centering \input{base_ts2.pstex_t} \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps. \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline This design uses an array containing a copy of the timestamps. These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.} \label{fig:base-ts2} The correctness argument is somewhat subtle. The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation. Therefore, it is acceptable if stale data makes queues appear older than they are but appearing fresher can be a problem. Therefore, it is acceptable if stale data makes queues appear older than they are, but appearing fresher can be a problem. For the timestamps, this means it is acceptable to miss writes to the timestamp since they make the head \at look older. For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written. With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines. The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket. For example on machines with multiple CPUs, cache misses can be satisfied from the caches on the same (local) CPU, or by the caches on a different (remote) CPU. Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU. However, these delays are not specific to systems with multiple CPUs. Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference. In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU. This means latency due to cache misses is fairly consistent. In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of the L3 cache. This means that latency due to cache misses is fairly consistent. In contrast, in Figure~\ref{fig:cache-noshare}, misses in the L2 cache can be satisfied by either instance of the L3 cache. However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3. The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}. The impact of these different designs on this algorithm is that scheduling only scales well on architectures with the L3 cache shared across many \glspl{hthrd}, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many L3 cache instances and less sharing, similar to Figure~\ref{fig:cache-noshare}. Hence, as the number of L3 instances grows, so too does the chance that the random helping causes significant cache latency. The solution is for the scheduler to be aware of the cache topology. Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem. This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory. This leaves the challenge of matching \procs to cache structure, or more precisely identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure. This leaves the challenge of matching \procs to cache structure, or more precisely, identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure. Once a match is generated, the helping algorithm is changed to add bias so that \procs more often help sub-queues local to the same cache substructure.\footnote{ Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning.} Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning beyond the order of magnitude.} The simplest approach for mapping sub-queues to cache structure is to statically tie sub-queues to CPUs. However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}. In this case, the large number of sub-queues and the bias against sub-queues tied to different cache substructures make it unlikely that every sub-queue is picked. To make things worst, the small number of \procs means that few helping attempts are made. To make things worse, the small number of \procs means that few helping attempts are made. This combination of low selection and few helping attempts allow a \at to become stranded on a sub-queue for a long time until it gets randomly helped. On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. On a system with 2 \procs, 256 \glspl{hthrd}, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. In this scenario, where each \proc attempts to help on 50\% of dequeues, the probability that a remote sub-queue gets help is $\frac{1}{51200}$ and follows a geometric distribution. Therefore the probability of the remote sub-queue gets help within the next 100'000 dequeues is only 85\%. Assuming dequeues happen every 100ns, there is still 15\% chance a \at could starve for more than 10ms and a 1\% chance the \at starves for 33.33ms, the maximum latency tolerated for interactive applications. If few \glspl{hthrd} share each cache instance, the probability that a \at is on a remote sub-queue becomes high. Therefore, a more dynamic match of sub-queues to cache instances is needed. \label{s:TopologicalWorkStealing} The approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to. This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. This tracking requires some finesse, because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance. Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s). Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, a \proc queries the CPU id and the corresponding cache instance. Since sub-queues are tied to \procs, a \proc can then update the cache instance mapped to the local sub-queue(s). To avoid unnecessary cache line invalidation, the map is only written-to if the mapping changes.
• ## doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

 r82a90d4 \chapter{Macro-Benchmarks}\label{macrobench} The previous chapter demonstrated the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios. The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios. The next step is to demonstrate performance stays true in more realistic and complete scenarios. Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments. Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments. Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products. As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios. The most obvious performance metric for web servers is throughput. This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up. Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service? Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception. \section{Memcached} Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}. Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}. \item The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24  \glspl{hthrd}) \item A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively. \item \item For UDP connections, all the threads listen to a single UDP socket for incoming requests. Threads that are not currently dealing with another request ignore the incoming packet. Threads that are currently dealing with another request ignore the incoming packet. One of the remaining, non-busy, threads reads the request and sends the response. This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. \subsection{Throughput} \label{memcd:tput} This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment. The clients then send read and write queries with only 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible. The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible. Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, Actual QPS'', for all three web servers. \subsection{Tail Latency} Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service? Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception. Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment. Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines. As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed. As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed. Because of this dramatic increase, the Y-axis is presented using a log scale. Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.} The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment. \subsection{NGINX threading} NGINX is a high-performance, \emph{full-service}, event-driven web server. It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}. This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance. The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes. When running as a static web server, it uses an event-driven architecture to service incoming requests. In comparison, the custom \CFA web server was developed specifically with this experiment in mind. However, nothing seems to indicate that NGINX suffers from the increased flexibility. When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve. \subsection{NGINX threading} When running as a static web server, NGINX uses an event-driven architecture to service incoming requests. Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines. For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections. Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io. Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io. The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool. However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads. The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate. \item Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates. \item Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. \item
• ## doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

 r82a90d4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. All of these systems have a \gls{uthrding} model. The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments. The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees. Note that only the code of the \CFA tests is shown; all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}. all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself. \section{Benchmark Environment}\label{microenv} \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:cycle:jax} \end{figure} \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:cycle:nasus} \end{figure} Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc. \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results. Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worst performance compared to the left column. Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column. This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism. This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available. Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details. Note the maximum of the Y-axis on Intel and AMD differ significantly. Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability. Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability. However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel. The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go. Again, the same performance increase for libfibre is visible when running fewer \ats. Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment. I did not investigate the libfibre performance boost for 1 cycle in this experiment. The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark. Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers. For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees. \section{Yield} \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:yield:jax} \end{figure} \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:yield:nasus} \end{figure} Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling. Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs. This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed. This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed. Go and Tokio still display the same performance collapse as on Intel. Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine. It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently. The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either steal'' the \at from another processor or find it on a remote queue. With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either steal'' the \at from another processor or find it on a remote queue. This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other's data. Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data. In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly. \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:churn:jax} \end{figure} Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs. Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values. Finally, Go gets the most peculiar results, scaling worst than other runtimes until 48 \procs. Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs. At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing. However, beyond this point Go keeps this level of variation but does not scale further in any of the runs. Throughput and scalability are notably worst for all runtimes than the previous benchmarks since there is inherently more communication between processors. Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors. Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second. Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling. However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worst and performance completely ceases to improve. However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve. At this point, the benchmark is dominated by inter-socket communication costs for all runtimes. \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:churn:nasus} \end{figure} \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:locality:jax} \end{figure} \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. For throughput, higher is better, for scalability, lower is better. Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} Each series represents 15 independent runs. The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} \label{fig:locality:nasus} \end{figure} Go still has the same poor performance as on Intel. Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio. Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio. Go still has the same poor performance. \end{centering} \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time). DNC stands for did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.} \label{fig:transfer:res} The first two columns show the results for the semaphore variation on Intel. While there are some differences in latencies, \CFA is consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results. Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal. While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results. Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal. Looking at the next two columns, the results for the yield variation on Intel, the story is very different. Neither Libfibre nor Tokio complete the experiment. This experiment clearly demonstrates that \CFA achieves significantly better fairness. This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee. The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly. Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader. Without \procs stealing from the \proc running the leader, the experiment cannot terminate. Go manages to complete the experiment because it adds preemption on top of classic work-stealing. However, since preemption is fairly infrequent, it achieves significantly worst performance. However, since preemption is fairly infrequent, it achieves significantly worse performance. In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness. Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.
• ## doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

 r82a90d4 As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa. Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc. Scheduling is also common in most other fields, \eg in assembly lines, assigning parts to line workers is a form of scheduling. Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling. In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler. A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input. This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information. Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts. Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods. When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed. This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts. Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods. When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed. \section{Naming Convention} Scheduling has been studied by various communities concentrating on different incarnations of the same problems. Scheduling has been studied by various communities, concentrating on different incarnations of the same problems. As a result, there are no standard naming conventions for scheduling that are respected across these communities. This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats. \section{Static Scheduling} \newterm{Static schedulers} require \ats dependencies and costs to be explicitly and exhaustively specified prior to scheduling. \newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling. The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution. This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information. \section{Dynamic Scheduling} \newterm{Dynamic schedulers} determine \at dependencies and costs during scheduling, if at all. Hence, unlike static scheduling, \at dependencies are conditional and detected at runtime. This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies. \newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all. This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies. Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled. As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies. \subsection{Explicitly Informed Dynamic Schedulers} While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc. When available, a scheduler can then use this information to direct the scheduling decisions. When available, a scheduler can then use this information to direct scheduling decisions. For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage. However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information; \subsubsection{Priority Scheduling} Common information used by schedulers to direct their algorithm is priorities. A common approach to direct the scheduling algorithm is to add information about \at priority. Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones. The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority. \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc. In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting. However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others. However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others. A common example where this is statically true is architectures with \glsxtrshort{numa}. In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}. \paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures. This is arguably strongly related to Task Placement but extends into more heterogeneous architectures. As \CFA offers no particular support for heterogeneous architecture, this is also an area that is less relevant to this thesis. Although it could be an interesting avenue for future work. As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis. However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant. \subsection{Theoretical Results} There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}. Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}. Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. \cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats. Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application. There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration. While this helps schedulers guarantee that no \ats unfairly monopolize a worker, preemption can effectively be added to any scheduler. This helps schedulers guarantee that no \ats unfairly monopolize a worker. Preemption can effectively be added to any scheduler. Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it. This section presents a quick overview of several current schedulers. While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers. As such, I believe these schedulers are at least as relevant as those presented in published work. Both Schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project. As such, I believe these schedulers are as relevant as those presented in published work. Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines. These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project. As such real-time schedulers are not considered for this work. \subsection{Operating System Schedulers} Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate. Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate. Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X. The information is less complete for closed source operating systems. It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach. \paragraph{Windows(OS)} \paragraph{Windows (OS)} Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications. The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. In~\cite{russinovich2009windows}, Chapter 1 section 2.3 Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth. The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 Processes, Threads, and Jobs''. Multicore scheduling is based on a combination of priorities and \proc preference. Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc. \paragraph{Go}\label{GoSafePoint} Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}. Preemption is present, but only at safe points,~\cite{go:safepoints} which are detection code inserted at various frequent access boundaries. Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries. The algorithm is as follows : \begin{enumerate} \item Once out of 61 times, pick 1 element from the \emph{GRQ}. \item If there is an item in the chair'' pick it. \item Otherwise, if there is an item in the chair'' pick it. \item Else pick an item from the \emph{LRQ}. \begin{itemize} \end{itemize} \end{enumerate} Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation. \paragraph{Erlang} \paragraph{LibFibre} LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo. Similarly to Go, it uses a variation of work stealing with a global queue that has a higher priority than stealing. Unlike Go, it does not have the high-priority next chair'' and does not use randomized work-stealing. It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing. Unlike Go, it does not have the high-priority next chair'' and its work-stealing is not randomized. Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation.
• ## doc/theses/thierry_delisle_PhD/thesis/text/front.tex

 r82a90d4 \chapter{User Level \io}\label{userio} As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources. When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}). Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system. This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. There are three options to monitor file descriptors in Linux:\footnote{ There are three options to monitor file descriptors (FD) in Linux:\footnote{ For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}. The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.} \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest. Hence, the array length must be as long as the largest FD currently of interest. On return, it outputs the set motified in place to identify which of the file descriptors changed state. On return, it outputs the set modified in-place to identify which of the file descriptors changed state. This destructive change means selecting in a loop requires re-initializing the array for each iteration. Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set. However, these changes are only reflected when the manager makes its next call to @select@. However, all three of these I/O systems have limitations. The @man@ page for @O_NONBLOCK@ mentions that [@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations. Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed. Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed. For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@. Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors. \subsection{POSIX asynchronous I/O (AIO)} An alternative to @O_NONBLOCK@ is the AIO interface. Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations. For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}. Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost. Leaving polling for completion, which is similar to the previous system calls. AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices. It also supports batching multiple operations in a single system call. AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. For \io multiplexing, @aio_suspend@ is the best interface. However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. Using AIO, programmers can enqueue operations which are to be performed asynchronously by the kernel. The kernel can communicate completions of these operations in three ways: it can spawn a new \gls{kthrd}; send a Linux signal; or userspace can poll for completion of one or more operations. Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution. Signals and their associated interrupt handlers can also lead to fairly complicated interactions between subsystems, and they have a non-trivial cost. This leaves a single option: polling for completion---this is similar to the previously discussed system calls. While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file descriptors can be regular files or block devices. AIO also supports batching multiple operations in a single system call. AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing. However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests. This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests. A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces. Like AIO, it represents \io operations as entries added to a queue. But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress. The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion. But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress. The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion. One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations. In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc. In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc. On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion. This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking. Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking. \subsection{Discussion} These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete. All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically. For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. For this project, I selected @io_uring@, in large parts because of its generality. These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete. All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically. For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. For this project, I selected @io_uring@, in large part because of its generality. While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem. \section{Event-Engine} An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}. In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}. The parked \ats are then rescheduled by the event engine once the desired operation has been completed. Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right). The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel. The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application. The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel. The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application. The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start; the completion ring contains entries for the completed I/O operation. \centering \input{io_uring.pstex_t} \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.} \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). While the completion ring contains plain data, the submission ring contains only references. These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.} \label{fig:iouring} \end{figure} Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations. Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated. Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted. Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted. \item The kernel is notified of the change to the ring using the system call @io_uring_enter@. The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned. The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions. The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions. \end{enumerate} When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring. Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation. It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring. The system call is only needed if the application wants to block waiting for operations to complete. The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring. @io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions. \end{sloppypar} This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time. An important detail to keep in mind is that just like The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads. An important detail to keep in mind is that just like The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads. Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads. In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}. To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user. Three kinds of operations that can need the \glspl{kthrd}: Three kinds of operations that can need the \glspl{kthrd} are: \paragraph{Operations using} @IOSQE_ASYNC@. \paragraph{Bounded operations.} This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices. @io_uring@ must also take this reality into account by delegating operations on regular files and block devices. Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices. In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}. \paragraph{Unbounded operations that must be retried.} While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket. Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}. Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel. @io_uring@ maintains a separate pool for these operations. The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}. Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel. Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@. however, the duration of the system call scales with the number of entries submitted. The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. Beyond this limit, the length of the system call is the throughput limiting factor. Beyond this limit, the length of the system call is the throughput-limiting factor. I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{ As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}. Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it. Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance. As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}. Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance. A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link. SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}). The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission. Supporting chains is not a requirement of the \io subsystem, but it is still valuable. For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable. Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission. To remove this requirement, a \at needs the ability to yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}. In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}. Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call. Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. \centering \input{pivot_ring.pstex_t} \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition. \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition. When submitting, the partition is advanced. The kernel considers the partition as the head of the ring.} However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. Instances running out of SQEs cannot run \ats wanting to do \io operations. In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed. In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur. A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. \Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. However, the \io \gls{proc} is unable to help because it is executing the spinning \at. This results in a deadlock. While this example is artificial, in the presence of many \ats, this problem can arise in the wild''. Furthermore, this pattern is difficult to reliably detect and avoid. \subsubsection{Public Instances} The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. \ats attempting an \io operation pick one of the available instances and submit the operation to that instance. \Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance. Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: \item The scheme to route \io requests to specific @io_uring@ instances does not introduce contention. This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm. This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm. \end{itemize} Allocation in this scheme is fairly easy. Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs. Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs. Allocation also does not require ordering guarantees as all free SQEs are interchangeable. The only added complexity is that the number of SQEs is fixed, which means allocation can fail. Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. With the pool of SQE instances approach, the big advantage is that it is fairly flexible. A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. The big advantage of the pool of SQE instances approach is that it is fairly flexible. It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead. All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases. \subsubsection{Instance borrowing} Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}. The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling. The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids. The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling. The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids. When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs. However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous. While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.} As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm. In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance. This instance is now bound to the \gls{proc} the \at is running on. This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state. \item The current \gls{proc} does not hold an instance. \item The current instance does not have sufficient SQEs to satisfy the request. \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. \item The current \gls{proc} has a wrong instance. This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}. \end{enumerate} However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{ Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.} Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.} A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag. If not, it proceeds, otherwise it delegates the operation to the arbiter. However, there is no need to immediately revoke the instance. External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed. This means whoever is responsible for the system call, first checks if the instance has any external submissions. This means whoever is responsible for the system call first checks whether the instance has any external submissions. If so, it asks the arbiter to revoke the instance and add the external submissions to the ring. \section{Interface} The last important part of the \io subsystem is its interface. The final part of the \io subsystem is its interface. Multiple approaches can be offered to programmers, each with advantages and disadvantages. The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. The following sections discuss some useful options using @read@ as an example. The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. The following sections discuss some useful options, using @read@ as an example. The standard Linux interface for C is: \begin{cfa} \subsection{Replacement} Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach. The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's. This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation. It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with. However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement. If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs. Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible. Replacing the C \io subsystem is the more intrusive and draconian approach. The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's. This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation. It also offers a presumably well known and familiar API that C programmers can simply continue to work with. %However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement. However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs. This approach was rejected as being laudable but infeasible. \subsection{Synchronous Extension} Another interface option is to offer an interface different in name only. In this approach, an alternative call is created for each supported system calls. For example: \begin{cfa} ssize_t cfa_read(int fd, void *buf, size_t count); \end{cfa} The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking. This approach is feasible and still familiar to C programmers. It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries. It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries. However, it has the advantage of implementation simplicity. Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as non-blocking'' \io. future(ssize_t) read(int fd, void *buf, size_t count); \end{cfa} where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested. where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested. The data read is placed in @buf@. The problem is that both the bytes read and data form the synchronization object, not just the bytes read. Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer. The problem is that both the bytes count and data form the synchronization object, not just the bytes read. Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this. A classical asynchronous API is: \begin{cfa} However, it is not the most user-friendly option. It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@. As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension: \begin{cfa} ssize_t cfa_read(int fd, void *buf, size_t count); future(ssize_t) async_read(int fd, void *buf, size_t count); \end{cfa}
 r82a90d4 \Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@. Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC). While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC. While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does. This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel. The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking. \section{\glsxtrshort{io}}\label{prev:io} Prior to this work, the \CFA runtime did not add any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows: Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows: \begin{quote} \end{quote} Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations. Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations. This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}. The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel. \section{Interoperating with C} While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: \begin{quote} All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions) \end{quote} Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model. Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model. Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences: As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences: \begin{enumerate} \item Precisely identifying blocking C calls is difficult. Because of these consequences, this work does not attempt to sandbox'' calls to C. Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model. Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs. Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications. Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.} Therefore, a complete solution to this problem is outside the scope of this thesis. \footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.} Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}.