Changeset ddcaff6 for doc/theses/thierry_delisle_PhD
- Timestamp:
- Nov 24, 2022, 3:41:44 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master
- Children:
- dacd8e6e
- Parents:
- 82a90d4
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 12 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/fig/base_ts2.fig
r82a90d4 rddcaff6 119 119 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2 120 120 2400 4950 3000 4950 121 4 2 -1 50 -1 0 12 0.0000 2 135 6 452100 3075 Threads\001122 4 2 -1 50 -1 0 12 0.0000 2 1 80 5252100 2850 Ready\001123 4 2 -1 50 -1 0 12 0.0000 2 1 80 660 2100 4200 Array of\001124 4 2 -1 50 -1 0 12 0.0000 2 1 65 600 2100 4425 Queues\001125 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 3550 TS\001126 4 2 -1 50 -1 0 12 0.0000 2 1 80 660 2100 5025 Array of\001127 4 1 -1 50 -1 0 11 0.0000 2 1 20 300 2700 4450 MA\001128 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 4225 TS\001129 4 1 -1 50 -1 0 11 0.0000 2 1 20 300 2700 5350 MA\001130 4 1 -1 50 -1 0 11 0.0000 2 1 20 210 2700 5125 TS\001131 4 2 -1 50 -1 0 12 0.0000 2 1 80 1590 2100 5250 Timestamps Copies\001132 4 2 -1 50 -1 0 12 0.0000 2 1 35 840 2100 6075 Processors\001121 4 2 -1 50 -1 0 12 0.0000 2 135 630 2100 3075 Threads\001 122 4 2 -1 50 -1 0 12 0.0000 2 165 450 2100 2850 Ready\001 123 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 4200 Array of\001 124 4 2 -1 50 -1 0 12 0.0000 2 150 540 2100 4425 Queues\001 125 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 3550 TS\001 126 4 2 -1 50 -1 0 12 0.0000 2 165 720 2100 5025 Array of\001 127 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4450 MA\001 128 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 4225 TS\001 129 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5350 MA\001 130 4 1 -1 50 -1 0 11 0.0000 2 135 180 2700 5125 TS\001 131 4 2 -1 50 -1 0 12 0.0000 2 135 900 2100 6075 Processors\001 132 4 2 -1 50 -1 0 12 0.0000 2 165 1440 2100 5250 Timestamp Copies\001 -
doc/theses/thierry_delisle_PhD/thesis/local.bib
r82a90d4 rddcaff6 615 615 } 616 616 617 @book{MAN:inteldev, 618 key = {Intel 64 and IA-32 Architectures Software Developer’s Manual}, 619 title = {Intel® 64 and IA-32 Architectures Software Developer’s Manual}, 620 publisher = {Intel{\textregistered}}, 621 year = {2016}, 622 volume = {3B: System Programming Guide, Part 2}, 623 url = {\href{https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html}{https://\-www.intel.com/\-content/\-www/\-us/\-en/\-architecture\--and\--technology/\-64\--ia\--32\--architectures\--software\--developer\--vol\--3b\--part\--2\--manual\-.html}, 624 } 625 617 626 @misc{MemcachedThreading, 618 627 author = {Oracle}, … … 673 682 } 674 683 684 @misc{MAN:kotlink, 685 howpublished = {\href{https://kotlinlang.org/docs/multiplatform-mobile-concurrency-and-coroutines.html}{https://\-kotlinlang.org\-/docs\-/multiplatform\--mobile\--concurrency\--and\--coroutines.html}} 686 } 687 675 688 @misc{MAN:java/fork-join, 676 689 howpublished = {\href{https://www.baeldung.com/java-fork-join}{https://\-www.baeldung.com/\-java-fork-join}} … … 965 978 howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", 966 979 note = "[Online; accessed 7-September-2022]" 980 } 981 982 @misc{wiki:rdtsc, 983 author = "{Wikipedia contributors}", 984 title = "Time Stamp Counter --- {W}ikipedia{,} The Free Encyclopedia", 985 year = "2022", 986 howpublished = "\href{https://en.wikipedia.org/wiki/Time\_Stamp\_Counter}{https://\-en.wikipedia.org/\-wiki/\-Time\-\_Stamp\-\_Counter}", 987 note = "[Online; accessed 14-November-2022]" 988 } 989 990 @misc{wiki:lockfree, 991 author = "{Wikipedia contributors}", 992 title = "Non-blocking algorithm --- {W}ikipedia{,} The Free Encyclopedia", 993 year = "2022", 994 howpublished = "\href{https://en.wikipedia.org/wiki/Non-blocking_algorithm}{https://en.wikipedia.org/\-wiki/Non\--blocking\-\_algorithm}", 995 note = "[Online; accessed 22-November-2022]" 996 } 997 998 @misc{wiki:lockfree, 999 author = "{Wikipedia contributors}", 1000 title = "Expected value --- {W}ikipedia{,} The Free Encyclopedia", 1001 year = "2022", 1002 howpublished = "\href{https://en.wikipedia.org/wiki/Expected_value}{https://en.wikipedia.org/\-wiki/\-Expected\-\_value}", 1003 note = "[Online; accessed 22-November-2022]" 1004 } 1005 1006 @misc{wiki:softirq, 1007 author = "{Wikipedia contributors}", 1008 title = "Interrupt --- {W}ikipedia{,} The Free Encyclopedia", 1009 year = "2022", 1010 howpublished = "\href{https://en.wikipedia.org/wiki/Interrupt}{https://en.wikipedia.org/\-wiki/\-Interrupt}", 1011 note = "[Online; accessed 24-November-2022]" 967 1012 } 968 1013 -
doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex
r82a90d4 rddcaff6 37 37 Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system. 38 38 Nonblocking I/O should not be handled in this way. 39 Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work. 39 40 40 41 \section{Goals} … … 58 59 59 60 As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. 60 Productivity and safety manifest in removing scheduling pitfalls inthe efficient usage of the threading runtime.61 Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime. 61 62 Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. 62 63 … … 99 100 I am aware there is a host of low-power research that could be tapped here. 100 101 102 \subsection{CPU Workloads} 103 A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it 104 CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily. 105 Examining such workloads in the context of scheduling would be interesting. 106 However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work. 107 101 108 \subsection{Hardware} 102 109 One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness. -
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r82a90d4 rddcaff6 24 24 In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware. 25 25 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. 26 This expectation of \at independence means the scheduler is expected to offer two guarantees:26 This expectation of \at independence means the scheduler is expected to offer two features: 27 27 \begin{enumerate} 28 \item A fairness guarantee: a \at that is ready to run is not prevented by another thread .29 \item A performance g uarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same.28 \item A fairness guarantee: a \at that is ready to run is not prevented by another thread indefinitely, \ie, starvation freedom. This is discussed further in the next section. 29 \item A performance goal: given a \at that wants to start running, other threads wanting to do the same do not interfere with it. 30 30 \end{enumerate} 31 31 32 It is important to note that these guarantees are expected only up to a point. 33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources. 34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. 35 36 Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point. 37 Ideally, the cost of running and blocking should be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. 32 The performance goal, the lack of interference among threads, is only desired up to a point. 33 Ideally, the cost of running and blocking should be constant regardless of contention, but the goal is considered satisfied if the cost is not \emph{too high} with or without contention. 38 34 How much is an acceptable cost is obviously highly variable. 39 35 For this document, the performance experimentation attempts to show the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. 40 36 This demonstration can be made by comparing applications built in \CFA to applications built with other languages or other models. 41 37 Recall programmer expectation is that the impact of the scheduler can be ignored. 42 Therefore, if the cost of scheduling is competitive with other popular languages, the g uarantee is considered achieved.38 Therefore, if the cost of scheduling is competitive with other popular languages, the goal is considered satisfied. 43 39 More precisely the scheduler should be: 44 40 \begin{itemize} 45 \item As fast as other schedulers that are less fair.46 \item Faster than other schedulers that have equal or better fairness.41 \item As fast as other schedulers without any fairness guarantee. 42 \item Faster than other schedulers that have equal or stronger fairness guarantees. 47 43 \end{itemize} 48 44 49 45 \subsection{Fairness Goals} 50 For this work, fairness is considered to have two strongly related requirements: true starvation freedom and ``fast'' load balancing. 51 52 \paragraph{True starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. 46 For this work, fairness is considered to have two strongly related requirements: 47 48 \paragraph{Starvation freedom} means as long as at least one \proc continues to dequeue \ats, all ready \ats should be able to run eventually, \ie, eventual progress. 49 Starvation freedom can be bounded or unbounded. 50 In the bounded case, all \ats should be able to run within a fix bound, relative to its own enqueue. 51 Whereas unbounded starvation freedom only requires the \at to eventually run. 52 The \CFA scheduler aims to guarantee unbounded starvation freedom. 53 53 In any running system, a \proc can stop dequeuing \ats if it starts running a \at that never blocks. 54 Without preemption, traditional work-stealing schedulers do not have starvation freedom in this case.55 Now, this requirement begs the question, what about preemption?54 Without preemption, traditional work-stealing schedulers do not have starvation freedom, bounded or unbounded. 55 Now, this requirement raises the question, what about preemption? 56 56 Generally speaking, preemption happens on the timescale of several milliseconds, which brings us to the next requirement: ``fast'' load balancing. 57 57 58 \paragraph{Fast load balancing} means that load balancing should happen faster than preemption would normally allow. 59 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved. 60 Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale. 61 With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system. 58 \paragraph{Fast load balancing} means that while eventual progress is guaranteed, it is important to mention on which timescale this progress is expected to happen. 59 Indeed, while a scheduler with bounded starvation freedom is beyond the scope of this work, offering a good expected bound in the mathematical sense~\cite{wiki:expected} is desirable. 60 The expected bound on starvation freedom should be tighter than what preemption normally allows. 61 For interactive applications that need to run at 60, 90 or 120 frames per second, \ats having to wait milliseconds to run are effectively starved. 62 Therefore load-balancing should be done at a faster pace: one that is expected to detect starvation at the microsecond scale. 62 63 63 64 \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal} … … 68 69 69 70 For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. 70 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available.71 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data.71 Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as a consequence cache lines, to a \gls{hthrd} that is currently available. 72 Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler, versus \emph{external locality}, \ie, how scheduling affects the locality of the application's data. 72 73 External locality is a much more complicated subject and is discussed in the next section. 73 74 74 However, I claim that in practice it is possible to strike a balance between fairness and performance because these goals do not necessarily overlap temporally. 75 Figure~\ref{fig:fair} shows a visual representation of this behaviour. 76 As mentioned, some unfairness is acceptable; therefore it is desirable to have an algorithm that prioritizes cache locality as long as thread delay does not exceed the execution mental model. 75 However, I claim that in practice it is possible to strike a balance between fairness and performance because these requirements do not necessarily overlap temporally. 76 Figure~\ref{fig:fair} shows a visual representation of this effect. 77 As mentioned, some unfairness is acceptable; for example, once the bounded starvation guarantee is met, additional fairness will not satisfy it \emph{more}. 78 Inversely, once a \at's data is evicted from cache, its locality cannot worsen. 79 Therefore it is desirable to have an algorithm that prioritizes cache locality as long as the fairness guarantee is also satisfied. 77 80 78 81 \begin{figure} … … 88 91 \subsection{Performance Challenges}\label{pref:challenge} 89 92 While there exists a multitude of potential scheduling algorithms, they generally always have to contend with the same performance challenges. 90 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe the central ones here before looking at the design. 93 Since these challenges are recurring themes in the design of a scheduler it is relevant to describe them here before looking at the scheduler's design. 94 95 \subsubsection{Latency} 96 The most basic performance metric of a scheduler is scheduling latency. 97 This measures the how long it takes for a \at to run once scheduled, including the cost of scheduling itself. 98 This measure include both the sequential cost of the operation itself, both also the scalability. 91 99 92 100 \subsubsection{Scalability} 93 The most basic performance challenge of a scheduler is scalability. 94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats. 101 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats relative to the available parallelism. 95 102 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements. 96 103 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention. 104 In the Chapter~\ref{microbench}, scalability is measured as $\# procs \times \frac{ns}{ops}$, \ie, number of \procs times total time over total operations. 105 Since the total number of operation should scale with the number of \procs, this gives a measure how much each additional \proc affects the other \procs. 97 106 98 107 \subsubsection{Migration Cost} … … 107 116 In general, a na\"{i}ve \glsxtrshort{fifo} ready-queue does not scale with increased parallelism from \glspl{hthrd}, resulting in decreased performance. 108 117 The problem is a single point of contention when adding/removing \ats. 109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering. 111 112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis. 118 As is shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 119 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue. 120 The sub-queues are accessed by multiple \glspl{hthrd} without the need for communication. 121 122 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration for the scheduler in this thesis. 113 123 114 124 \subsection{Work-Stealing} … … 116 126 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing. 117 127 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue. 118 The interesting aspect of work stealing happensin the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed.119 In this case, work stealing is close to optimal scheduling : it can achieve perfect locality and have no contention.120 On the other hand, work-stealing schedulers only attemptto do load-balancing when a \gls{proc} runs out of work.128 The interesting aspect of work stealing manifests itself in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. 129 In this case, work stealing is close to optimal scheduling latency: it can achieve perfect locality and have no contention. 130 On the other hand, work-stealing only attempts to do load-balancing when a \gls{proc} runs out of work. 121 131 This means that the scheduler never balances unfair loads unless they result in a \gls{proc} running out of work. 122 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to indefinitestarvation.123 124 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, but load balance proactively when the need arises.132 Chapter~\ref{microbench} shows that, in pathological cases, work stealing can lead to unbounded starvation. 133 134 Based on these observations, the conclusion is that a \emph{perfect} scheduler should behave similarly to work-stealing in the steady-state case, \ie, avoid migrations in well balanced workloads, but load balance proactively when the need arises. 125 135 126 136 \subsection{Relaxed-FIFO} 127 A different scheduling approach is t o create a``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.137 A different scheduling approach is the ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}. 128 138 This approach forgoes any ownership between \gls{proc} and sub-queue, and simply creates a pool of sub-queues from which \glspl{proc} pick. 129 139 Scheduling is performed as follows: … … 134 144 Timestamps are added to each element of a sub-queue. 135 145 \item 136 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues .146 A \gls{proc} randomly tests sub-queues until it has acquired one or two queues, referred to as \newterm{randomly picking} or \newterm{randomly helping}. 137 147 \item 138 148 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues. … … 148 158 However, \glspl{proc} eagerly search for these older elements instead of focusing on specific queues, which negatively affects locality. 149 159 150 While this scheme has good fairness, its performance suffers. 151 It requires wide sharding, \eg at least 4 queues per \gls{hthrd}, and finding non-empty queues is difficult when there are few ready \ats. 160 While this scheme has good fairness, its performance can be improved. 161 Wide sharding is generally desired, \eg at least 4 queues per \proc, and randomly picking non-empty queues is difficult when there are few ready \ats. 162 The next sections describe improvements I made to this existing algorithm. 163 However, ultimately the ``relaxed-FIFO'' queue is not used as the basis of the \CFA scheduler. 152 164 153 165 \section{Relaxed-FIFO++} 154 The inherent fairness and goodperformance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler.166 The inherent fairness and decent performance with many \ats make the relaxed-FIFO queue a good candidate to form the basis of a new scheduler. 155 167 The problem case is workloads where the number of \ats is barely greater than the number of \procs. 156 168 In these situations, the wide sharding of the ready queue means most of its sub-queues are empty. … … 162 174 163 175 As this is the most obvious challenge, it is worth addressing first. 164 The obvious solution is to supplement each sharded sub-queue with data that indicates if the queue is empty/nonempty to simplify finding nonempty queues, \ie ready \glspl{at}. 165 This sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues. 166 Specifically, many modern architectures have powerful bitmask manipulation instructions or searching a binary tree has good Big-O complexity. 176 The seemingly obvious solution is to supplement each sharded sub-queue with data that indicates whether the queue is empty/nonempty. 177 This simplifies finding nonempty queues, \ie ready \glspl{at}. 178 The sharded data can be organized in different forms, \eg a bitmask or a binary tree that tracks the nonempty sub-queues, using a bit or a node per sub-queue, respectively. 179 Specifically, many modern architectures have powerful bitmask manipulation instructions, and, searching a binary tree has good Big-O complexity. 167 180 However, precisely tracking nonempty sub-queues is problematic. 168 181 The reason is that the sub-queues are initially sharded with a width presumably chosen to avoid contention. 169 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie denser than the sharded sub-queues.182 However, tracking which ready queue is nonempty is only useful if the tracking data is dense, \ie tracks whether multiple sub-queues are empty. 170 183 Otherwise, it does not provide useful information because reading this new data structure risks being as costly as simply picking a sub-queue at random. 171 184 But if the tracking mechanism \emph{is} denser than the shared sub-queues, then constant updates invariably create a new source of contention. … … 184 197 185 198 \subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy} 186 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} .199 The Relaxed-FIFO approach can be made to handle the case of mostly empty sub-queues by tweaking the \glsxtrlong{prng} that drives the random picking of sub-queues. 187 200 The \glsxtrshort{prng} state can be seen as containing a list of all the future sub-queues that will be accessed. 188 201 While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the sub-queues that were accessed. 189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators \cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance.202 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators~\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance. 190 203 This particular \glsxtrshort{prng} can be used as follows: 191 204 \begin{itemize} … … 193 206 Each \proc maintains two \glsxtrshort{prng} states, referred to as $F$ and $B$. 194 207 \item 195 When a \proc attempts to dequeue a \at, it picks a sub-queue by running $B$ backwards.196 \item 197 When a \proc attempts to enqueue a \at, it runs $F$ forward picking a sub-queue to enqueue to.198 If the enqueue is successful, state $B$ is overwritten with the content of$F$.208 When a \proc attempts to dequeue a \at, it picks a sub-queue by running its $B$ backwards. 209 \item 210 When a \proc attempts to enqueue a \at, it runs its $F$ forward picking a sub-queue to enqueue to. 211 If the enqueue is successful, state of its $B$ is overwritten with the content of its $F$. 199 212 \end{itemize} 200 213 The result is that each \proc tends to dequeue \ats that it has itself enqueued. 201 214 When most sub-queues are empty, this technique increases the odds of finding \ats at a very low cost, while also offering an improvement on locality in many cases. 202 215 203 Tests showed this approach performs better than relaxed-FIFO in many cases.216 My own tests showed this approach performs better than relaxed-FIFO in many cases. 204 217 However, it is still not competitive with work-stealing algorithms. 205 The fundamental problem is that the constantrandomness limits how much locality the scheduler offers.218 The fundamental problem is that the randomness limits how much locality the scheduler offers. 206 219 This becomes problematic both because the scheduler is likely to get cache misses on internal data structures and because migrations become frequent. 207 220 Therefore, the attempt to modify the relaxed-FIFO algorithm to behave more like work stealing did not pan out. … … 214 227 Before attempting to dequeue from a \proc's sub-queue, the \proc must make some effort to ensure other sub-queues are not being neglected. 215 228 To make this possible, \procs must be able to determine which \at has been on the ready queue the longest. 216 Second, the relaxed-FIFO approach needs timestamps for each \at to make this possible. 229 Second, the relaxed-FIFO approach uses timestamps, denoted TS, for each \at to make this possible. 230 Theses timestamps can be added to work stealing. 217 231 218 232 \begin{figure} 219 233 \centering 220 234 \input{base.pstex_t} 221 \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of sub-queues offers the sharding, twoper \proc.235 \caption[Base \CFA design]{Base \CFA design \smallskip\newline It uses a pool of sub-queues, with a sharding of two sub-queue per \proc. 222 236 Each \gls{proc} can access all of the sub-queues. 223 237 Each \at is timestamped when enqueued.} … … 227 241 Figure~\ref{fig:base} shows the algorithm structure. 228 242 This structure is similar to classic work-stealing except the sub-queues are placed in an array so \procs can access them in constant time. 229 Sharding widthcan be adjusted based on contention.230 Note, as an optimization, the TSof a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS.243 Sharding can be adjusted based on contention. 244 As an optimization, the timestamp of a \at is stored in the \at in front of it, so the first TS is in the array and the last \at has no TS. 231 245 This organization keeps the highly accessed front TSs directly in the array. 232 246 When a \proc attempts to dequeue a \at, it first picks a random remote sub-queue and compares its timestamp to the timestamps of its local sub-queue(s). 233 The oldest waiting \at is dequeued to provide global fairness. 247 The oldest waiting of the compared \ats is dequeued. 248 In this document, picking from a remote sub-queue in this fashion is referred to as ``helping''. 249 250 The timestamps are measured using the CPU's hardware timestamp counters~\cite{wiki:rdtsc}. 251 These provide a 64-bit counter that tracks the number of cycles since the CPU was powered on. 252 Assuming the CPU runs at less than 5 GHz, this means that the 64-bit counter takes over a century before overflowing. 253 This is true even on 32-bit CPUs, where the counter is generally still 64-bit. 254 However, on many architectures, the instructions to read the counter do not have any particular ordering guarantees. 255 Since the counter does not depend on any data in the cpu pipeline, this means there is significant flexibility for the instruction to be read out of order, which limites the accuracy to a window of code. 256 Finally, another issue that can come up with timestamp counters is synchronization between \glspl{hthrd}. 257 This appears to be mostly a historical concern, as recent CPU offer more synchronization guarantees. 258 For example, Intel supports "Invariant TSC" \cite[\S~17.15.1]{MAN:inteldev} which is guaranteed to be synchronized across \glspl{hthrd}. 234 259 235 260 However, this na\"ive implementation has performance problems. 236 First, it is necessary to have some damping effect on helping.237 Random effects like cache misses and preemption can add spurious but short bursts of latency negating the attempt to help.238 These bursts can cause increased migrations and make this work-stealing approach slow down to the level of relaxed-FIFO.261 First, it is necessary to avoid helping when it does not improve fairness. 262 Random effects like cache misses and preemption can add unpredictable but short bursts of latency but do not warrant the cost of helping. 263 These bursts can cause increased migrations, at which point this same locality problems as in the relaxed-FIFO approach start to appear. 239 264 240 265 \begin{figure} … … 246 271 247 272 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}. 248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.273 Note that this is more complex than it can appear because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. 249 274 Therefore, the exponential moving average is an average of how long each dequeued \at has waited. 250 275 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue. 251 276 This new waiting is averaged with the stored average. 252 277 To further limit \glslink{atmig}{migrations}, a bias can be added to a local sub-queue, where a remote sub-queue is helped only if its moving average is more than $X$ times the local sub-queue's average. 253 Tests for this approach indicate the choice of the weight for the moving average or the bias isnot important, \ie weights and biases of similar \emph{magnitudes} have similar effects.254 255 With these additions to work stealing, scheduling can be made as fair as the relaxed-FIFO approach, avoiding the majority of unnecessary migrations.278 Tests for this approach indicate the precise values for the weight of the moving average and the bias are not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. 279 280 With these additions to work stealing, scheduling can satisfy the starvation freedom guarantee while suffering much less from unnecessary migrations than the relaxed-FIFO approach. 256 281 Unfortunately, the work to achieve fairness has a performance cost, especially when the workload is inherently fair, and hence, there is only short-term unfairness or no starvation. 257 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated , \ie, writes.258 To make things wors t, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss.282 The problem is that the constant polling, \ie reads, of remote sub-queues generally entails cache misses because the TSs are constantly being updated. 283 To make things worse, remote sub-queues that are very active, \ie \ats are frequently enqueued and dequeued from them, lead to higher chances that polling will incur a cache-miss. 259 284 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue. 260 285 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost. … … 264 289 The problem with polling remote sub-queues is that correctness is critical. 265 290 There must be a consensus among \procs on which sub-queues hold which \ats, as the \ats are in constant motion. 266 Furthermore, since timestamps are used for fairness, it is critical t o have a consensus on which \at is the oldest.291 Furthermore, since timestamps are used for fairness, it is critical that the oldest \ats eventually be recognized as such. 267 292 However, when deciding if a remote sub-queue is worth polling, correctness is less of a problem. 268 293 Since the only requirement is that a sub-queue is eventually polled, some data staleness is acceptable. … … 278 303 \centering 279 304 \input{base_ts2.pstex_t} 280 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is addedcontaining a copy of the timestamps.305 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline This design uses an array containing a copy of the timestamps. 281 306 These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.} 282 307 \label{fig:base-ts2} … … 285 310 The correctness argument is somewhat subtle. 286 311 The data used for deciding whether or not to poll a queue can be stale as long as it does not cause starvation. 287 Therefore, it is acceptable if stale data makes queues appear older than they are but appearing fresher can be a problem.312 Therefore, it is acceptable if stale data makes queues appear older than they are, but appearing fresher can be a problem. 288 313 For the timestamps, this means it is acceptable to miss writes to the timestamp since they make the head \at look older. 289 314 For the moving average, as long as the operations are just atomic reads/writes, the average is guaranteed to yield a value that is between the oldest and newest values written. … … 292 317 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines. 293 318 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. 294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.319 For example on machines with multiple CPUs, cache misses can be satisfied from the caches on the same (local) CPU, or by the caches on a different (remote) CPU. 295 320 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU. 296 321 However, these delays are not specific to systems with multiple CPUs. … … 312 337 Figures~\ref{fig:cache-share} and~\ref{fig:cache-noshare} show two different cache topologies that highlight this difference. 313 338 In Figure~\ref{fig:cache-share}, all cache misses are either private to a CPU or shared with another CPU. 314 This means latency due to cache misses is fairly consistent.315 In contrast, in Figure~\ref{fig:cache-noshare} misses in the L2 cache can be satisfied by either instance of the L3 cache.339 This means that latency due to cache misses is fairly consistent. 340 In contrast, in Figure~\ref{fig:cache-noshare}, misses in the L2 cache can be satisfied by either instance of the L3 cache. 316 341 However, the memory-access latency to the remote L3 is higher than the memory-access latency to the local L3. 317 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with a wide L3 cache, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many narrower L3 cache instances, similar to Figure~\ref{fig:cache-noshare}.342 The impact of these different designs on this algorithm is that scheduling only scales well on architectures with the L3 cache shared across many \glspl{hthrd}, similar to Figure~\ref{fig:cache-share}, and less well on architectures with many L3 cache instances and less sharing, similar to Figure~\ref{fig:cache-noshare}. 318 343 Hence, as the number of L3 instances grows, so too does the chance that the random helping causes significant cache latency. 319 344 The solution is for the scheduler to be aware of the cache topology. … … 323 348 Unfortunately, there is no portable way to discover cache topology, and it is outside the scope of this thesis to solve this problem. 324 349 This work uses the cache topology information from Linux's @/sys/devices/system/cpu@ directory. 325 This leaves the challenge of matching \procs to cache structure, or more precisely identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure.350 This leaves the challenge of matching \procs to cache structure, or more precisely, identifying which sub-queues of the ready queue are local to which subcomponents of the cache structure. 326 351 Once a match is generated, the helping algorithm is changed to add bias so that \procs more often help sub-queues local to the same cache substructure.\footnote{ 327 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning .}352 Note that like other biases mentioned in this section, the actual bias value does not appear to need precise tuning beyond the order of magnitude.} 328 353 329 354 The simplest approach for mapping sub-queues to cache structure is to statically tie sub-queues to CPUs. … … 335 360 However, it can still cause some subtle fairness problems in systems with few \procs and many \glspl{hthrd}. 336 361 In this case, the large number of sub-queues and the bias against sub-queues tied to different cache substructures make it unlikely that every sub-queue is picked. 337 To make things wors t, the small number of \procs means that few helping attempts are made.362 To make things worse, the small number of \procs means that few helping attempts are made. 338 363 This combination of low selection and few helping attempts allow a \at to become stranded on a sub-queue for a long time until it gets randomly helped. 339 On a system with 2 \procs, 256 \glspl{hthrd} with narrow cache sharing, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. 364 On a system with 2 \procs, 256 \glspl{hthrd}, and a 100:1 bias, it can take multiple seconds for a \at to get dequeued from a remote queue. 365 In this scenario, where each \proc attempts to help on 50\% of dequeues, the probability that a remote sub-queue gets help is $\frac{1}{51200}$ and follows a geometric distribution. 366 Therefore the probability of the remote sub-queue gets help within the next 100'000 dequeues is only 85\%. 367 Assuming dequeues happen every 100ns, there is still 15\% chance a \at could starve for more than 10ms and a 1\% chance the \at starves for 33.33ms, the maximum latency tolerated for interactive applications. 368 If few \glspl{hthrd} share each cache instance, the probability that a \at is on a remote sub-queue becomes high. 340 369 Therefore, a more dynamic match of sub-queues to cache instances is needed. 341 370 … … 343 372 \label{s:TopologicalWorkStealing} 344 373 The approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to. 345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place.374 This tracking requires some finesse, because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. 346 375 A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. 347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each\proc queries the CPU id and the corresponding cache instance.348 Since sub-queues are tied to \procs, each\proc can then update the cache instance mapped to the local sub-queue(s).376 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, a \proc queries the CPU id and the corresponding cache instance. 377 Since sub-queues are tied to \procs, a \proc can then update the cache instance mapped to the local sub-queue(s). 349 378 To avoid unnecessary cache line invalidation, the map is only written-to if the mapping changes. 350 379 -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r82a90d4 rddcaff6 1 1 \chapter{Macro-Benchmarks}\label{macrobench} 2 The previous chapter demonstrated th e \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios.2 The previous chapter demonstrated that the \CFA scheduler achieves its equivalent performance goal in small and controlled \at-scheduling scenarios. 3 3 The next step is to demonstrate performance stays true in more realistic and complete scenarios. 4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate \CFA performs competitively compared to web servers used in production environments.4 Therefore, this chapter exercises both \at and I/O scheduling using two flavours of web servers that demonstrate that \CFA performs competitively compared to web servers used in production environments. 5 5 6 6 Web servers are chosen because they offer fairly simple applications that perform complex I/O, both network and disk, and are useful as standalone products. … … 10 10 As such, these experiments should highlight the overhead due to any \CFA fairness cost in realistic scenarios. 11 11 12 The most obvious performance metric for web servers is throughput. 13 This metric generally measures the speed at which the server answers and relatedly how fast clients can send requests before the server can no longer keep-up. 14 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service? 15 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception. 16 12 17 \section{Memcached} 13 18 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}. … … 26 31 Each CPU has 6 cores and 2 \glspl{hthrd} per core, for a total of 24 \glspl{hthrd}. 27 32 \item 33 The machine is configured to run each servers on 12 dedicated \glspl{hthrd} and uses 6 of the remaining \glspl{hthrd} for the software interrupt handling~\cite{wiki:softirq}, resulting in maximum CPU utilization of 75\% (18 / 24 \glspl{hthrd}) 34 \item 28 35 A CPU has 384 KB, 3 MB and 30 MB of L1, L2 and L3 caches, respectively. 29 36 \item … … 47 54 \item 48 55 For UDP connections, all the threads listen to a single UDP socket for incoming requests. 49 Threads that are notcurrently dealing with another request ignore the incoming packet.56 Threads that are currently dealing with another request ignore the incoming packet. 50 57 One of the remaining, non-busy, threads reads the request and sends the response. 51 58 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. … … 79 86 \subsection{Throughput} \label{memcd:tput} 80 87 This experiment is done by having the clients establish 15,360 total connections, which persist for the duration of the experiment. 81 The clients then send read and write queries with only3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible.88 The clients then send read and write queries with 3\% writes (updates), attempting to follow a desired query rate, and the server responds to the desired rate as best as possible. 82 89 Figure~\ref{fig:memcd:rate:qps} shows the 3 server versions at different client rates, ``Target \underline{Q}ueries \underline{P}er \underline{S}econd'', and the actual rate, ``Actual QPS'', for all three web servers. 83 90 … … 104 111 105 112 \subsection{Tail Latency} 106 Another popular performance metric is \newterm{tail} latency, which indicates some notion of fairness among requests across the experiment, \ie do some requests wait longer than other requests for service?107 Since many web applications rely on a combination of different queries made in parallel, the latency of the slowest response, \ie tail latency, can dictate a performance perception.108 113 Figure~\ref{fig:memcd:rate:tail} shows the 99th percentile latency results for the same Memcached experiment. 109 114 110 115 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines. 111 As expected, the latency starts low and increases as the server gets close to saturation, at which point , the latency increases dramatically because the web servers cannot keep up with the connection rateso client requests are disproportionally delayed.116 As expected, the latency starts low and increases as the server gets close to saturation, at which point the latency increases dramatically because the web servers cannot keep up with the connection rate, so client requests are disproportionally delayed. 112 117 Because of this dramatic increase, the Y-axis is presented using a log scale. 113 118 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. … … 186 191 web servers servicing dynamic requests, which read from multiple locations and construct a response, are not as interesting since creating the response takes more time and does not exercise the runtime in a meaningfully different way.} 187 192 The static web server experiment compares NGINX~\cite{nginx} with a custom \CFA-based web server developed for this experiment. 188 189 \subsection{NGINX threading}190 193 NGINX is a high-performance, \emph{full-service}, event-driven web server. 191 194 It can handle both static and dynamic web content, as well as serve as a reverse proxy and a load balancer~\cite{reese2008nginx}. 192 195 This wealth of capabilities comes with a variety of potential configurations, dictating available features and performance. 193 196 The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes. 194 When running as a static web server, it uses an event-driven architecture to service incoming requests. 197 In comparison, the custom \CFA web server was developed specifically with this experiment in mind. 198 However, nothing seems to indicate that NGINX suffers from the increased flexibility. 199 When tuned for performance, NGINX appears to achieve the performance that the underlying hardware can achieve. 200 201 \subsection{NGINX threading} 202 When running as a static web server, NGINX uses an event-driven architecture to service incoming requests. 195 203 Incoming connections are assigned a \emph{stackless} HTTP state machine and worker processes can handle thousands of these state machines. 196 204 For the following experiment, NGINX is configured to use @epoll@ to listen for events on these state machines and have each worker process independently accept new connections. 197 Because of the realities of Linux, see Subsection~\ref{ononblock}, NGINX also maintains a pool of auxiliary threads to handle blocking \io.205 Because of the realities of Linux, (Subsection~\ref{ononblock}), NGINX also maintains a pool of auxiliary threads to handle blocking \io. 198 206 The configuration can set the number of worker processes desired, as well as the size of the auxiliary pool. 199 207 However, for the following experiments, NGINX is configured to let the master process decide the appropriate number of threads. … … 262 270 The computer is booted with only 8 CPUs enabled, which is sufficient to achieve line rate. 263 271 \item 272 Both servers are setup with enough parallelism to achieve 100\% CPU utilization, which happens at higher request rates. 273 \item 264 274 Each CPU has 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 265 275 \item -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r82a90d4 rddcaff6 4 4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. 5 5 All of these systems have a \gls{uthrding} model. 6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.6 The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees. 7 7 Note that only the code of the \CFA tests is shown; 8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.8 all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself. 9 9 10 10 \section{Benchmark Environment}\label{microenv} … … 129 129 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 130 130 For throughput, higher is better, for scalability, lower is better. 131 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 131 Each series represents 15 independent runs. 132 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 132 133 \label{fig:cycle:jax} 133 134 \end{figure} … … 161 162 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 162 163 For throughput, higher is better, for scalability, lower is better. 163 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 Each series represents 15 independent runs. 165 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 166 \label{fig:cycle:nasus} 165 167 \end{figure} … … 177 179 Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc. 178 180 \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results. 179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly wors tperformance compared to the left column.181 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column. 180 182 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism. 181 183 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available. … … 185 187 Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details. 186 188 Note the maximum of the Y-axis on Intel and AMD differ significantly. 187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.189 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability. 188 190 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel. 189 191 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. … … 191 193 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go. 192 194 Again, the same performance increase for libfibre is visible when running fewer \ats. 193 Note,I did not investigate the libfibre performance boost for 1 cycle in this experiment.195 I did not investigate the libfibre performance boost for 1 cycle in this experiment. 194 196 195 197 The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark. 196 198 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. 197 199 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other , less fair, schedulers.200 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees. 199 201 200 202 \section{Yield} … … 250 252 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 251 253 For throughput, higher is better, for scalability, lower is better. 252 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 254 Each series represents 15 independent runs. 255 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 253 256 \label{fig:yield:jax} 254 257 \end{figure} … … 309 312 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 310 313 For throughput, higher is better, for scalability, lower is better. 311 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 314 Each series represents 15 independent runs. 315 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 312 316 \label{fig:yield:nasus} 313 317 \end{figure} … … 317 321 Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling. 318 322 Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs. 319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.323 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed. 320 324 Go and Tokio still display the same performance collapse as on Intel. 321 325 Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine. … … 324 328 325 329 It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently. 326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.330 The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. 327 331 328 332 … … 333 337 334 338 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. 335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.339 With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue. 336 340 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. 337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other 'sdata.341 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data. 338 342 In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly. 339 343 … … 392 396 \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 393 397 For throughput, higher is better, for scalability, lower is better. 394 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 398 Each series represents 15 independent runs. 399 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 395 400 \label{fig:churn:jax} 396 401 \end{figure} … … 404 409 Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs. 405 410 Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values. 406 Finally, Go gets the most peculiar results, scaling wors tthan other runtimes until 48 \procs.411 Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs. 407 412 At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing. 408 413 However, beyond this point Go keeps this level of variation but does not scale further in any of the runs. 409 414 410 Throughput and scalability are notably wors tfor all runtimes than the previous benchmarks since there is inherently more communication between processors.415 Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors. 411 416 Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second. 412 417 Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling. 413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to wors tand performance completely ceases to improve.418 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve. 414 419 At this point, the benchmark is dominated by inter-socket communication costs for all runtimes. 415 420 … … 457 462 \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 458 463 For throughput, higher is better, for scalability, lower is better. 459 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 464 Each series represents 15 independent runs. 465 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 460 466 \label{fig:churn:nasus} 461 467 \end{figure} … … 600 606 \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 601 607 For throughput, higher is better, for scalability, lower is better. 602 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 608 Each series represents 15 independent runs. 609 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 603 610 \label{fig:locality:jax} 604 611 \end{figure} … … 632 639 \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 633 640 For throughput, higher is better, for scalability, lower is better. 634 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 641 Each series represents 15 independent runs. 642 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 635 643 \label{fig:locality:nasus} 636 644 \end{figure} … … 648 656 Go still has the same poor performance as on Intel. 649 657 650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.658 Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio. 651 659 Go still has the same poor performance. 652 660 … … 751 759 \end{centering} 752 760 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. 761 For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time). 753 762 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.} 754 763 \label{fig:transfer:res} … … 764 773 765 774 The first two columns show the results for the semaphore variation on Intel. 766 While there are some differences in latencies, \CFA isconsistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.775 While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results. 776 Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal. 768 777 769 778 Looking at the next two columns, the results for the yield variation on Intel, the story is very different. … … 780 789 Neither Libfibre nor Tokio complete the experiment. 781 790 782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.791 This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee. 783 792 The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly. 784 793 Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader. … … 790 799 Without \procs stealing from the \proc running the leader, the experiment cannot terminate. 791 800 Go manages to complete the experiment because it adds preemption on top of classic work-stealing. 792 However, since preemption is fairly infrequent, it achieves significantly wors tperformance.801 However, since preemption is fairly infrequent, it achieves significantly worse performance. 793 802 In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness. 794 803 Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep. -
doc/theses/thierry_delisle_PhD/thesis/text/existing.tex
r82a90d4 rddcaff6 2 2 As stated, scheduling is the process of assigning resources to incoming requests, where the common example is assigning available workers to work requests or vice versa. 3 3 Common scheduling examples in Computer Science are: operating systems and hypervisors schedule available CPUs, NICs schedule available bandwidth, virtual memory and memory allocator schedule available storage, \etc. 4 Scheduling is also common in most other fields ,\eg in assembly lines, assigning parts to line workers is a form of scheduling.4 Scheduling is also common in most other fields; \eg in assembly lines, assigning parts to line workers is a form of scheduling. 5 5 6 6 In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler. … … 8 8 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input. 9 9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information. 10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts.11 12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned and workloads are likely to stay stable for long periods.13 When a faster pace is needed and changes are much more frequent gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed.10 This description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts. 11 12 Scheduling physical resources, \eg in an assembly line, is generally amenable to using well-informed scheduling since information can be gathered much faster than the physical resources can be assigned, and workloads are likely to stay stable for long periods. 13 When a faster pace is needed and changes are much more frequent, then gathering information on workloads, up-front or live, can become much more limiting and more general schedulers are needed. 14 14 15 15 \section{Naming Convention} 16 Scheduling has been studied by various communities concentrating on different incarnations of the same problems.16 Scheduling has been studied by various communities, concentrating on different incarnations of the same problems. 17 17 As a result, there are no standard naming conventions for scheduling that are respected across these communities. 18 18 This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats. 19 19 20 20 \section{Static Scheduling} 21 \newterm{Static schedulers} require \at sdependencies and costs to be explicitly and exhaustively specified prior to scheduling.21 \newterm{Static schedulers} require \at dependencies and costs to be explicitly and exhaustively specified prior to scheduling. 22 22 The scheduler then processes this input ahead of time and produces a \newterm{schedule} the system follows during execution. 23 23 This approach is popular in real-time systems since the need for strong guarantees justifies the cost of determining and supplying this information. … … 27 27 28 28 \section{Dynamic Scheduling} 29 \newterm{Dynamic schedulers} determine \at dependencies and costs during scheduling, if at all. 30 Hence, unlike static scheduling, \at dependencies are conditional and detected at runtime. 31 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, including suspending or halting a \at that dynamically detects unfulfilled dependencies. 29 \newterm{Dynamic schedulers} detect \at dependencies and costs during scheduling, if at all. 30 This detection takes the form of observing new \ats in the system and determining dependencies from their behaviour, where a \at suspends or halts dynamically when it detects unfulfilled dependencies. 32 31 Furthermore, each \at has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled. 33 32 As a consequence, the scheduler often has an incomplete view of the system, seeing only \ats with no pending dependencies. … … 35 34 \subsection{Explicitly Informed Dynamic Schedulers} 36 35 While dynamic schedulers may not have an exhaustive list of dependencies for a \at, some information may be available about each \at, \eg expected duration, required resources, relative importance, \etc. 37 When available, a scheduler can then use this information to direct thescheduling decisions.36 When available, a scheduler can then use this information to direct scheduling decisions. 38 37 For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage. 39 38 However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information; … … 45 44 46 45 \subsubsection{Priority Scheduling} 47 Common information used by schedulers to direct their algorithm is priorities.46 A common approach to direct the scheduling algorithm is to add information about \at priority. 48 47 Each \at is given a priority, and higher-priority \ats are preferred to lower-priority ones. 49 48 The simplest priority scheduling algorithm is to require that every \at have a distinct pre-established priority and always run the available \ats with the highest priority. … … 81 80 \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc. 82 81 In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting. 83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mappingmore interesting than others.82 However, in real-life architectures, there are contexts where different \procs can have different characteristics, which makes some mappings more interesting than others. 84 83 A common example where this is statically true is architectures with \glsxtrshort{numa}. 85 84 In these cases, it can be relevant to change the scheduler to be cognizant of the topology~\cite{vikranth2013topology,min2011hierarchical}. … … 88 87 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how applicable work stealing is to different machine architectures. 89 88 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures. 90 As \CFA offers no particular support for heterogeneous architecture , this is also an area that is less relevant tothis thesis.91 Although it could be an interesting avenue for future work.89 As \CFA offers no particular support for heterogeneous architectures, this is also an area that is not examined in this thesis. 90 However, support for concurrency across heterogeneous architectures is interesting avenue for future work, at which point the literature on this topic and how it relates to scheduling will become relevant. 92 91 93 92 \subsection{Theoretical Results} 94 93 There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of \glslink{atmig}{migration}~\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance~\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogeneous systems~\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. 95 \cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}.94 Blelloch et al.~\cite{DBLP:journals/jacm/BlellochGM99} examines the space bounds of work stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} shows that for under-loaded systems, the scheduler completes its computations in finite time, \ie is \newterm{stable}. 96 95 Others show that work stealing applies to various scheduling contexts~\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. 97 96 \cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats. … … 104 103 Preemption is the idea of interrupting \ats that have been running too long, effectively injecting suspend points into the application. 105 104 There are multiple techniques to achieve this effect, but they all aim to guarantee that the suspend points in a \at are never further apart than some fixed duration. 106 While this helps schedulers guarantee that no \ats unfairly monopolize a worker, preemption can effectively be added to any scheduler. 105 This helps schedulers guarantee that no \ats unfairly monopolize a worker. 106 Preemption can effectively be added to any scheduler. 107 107 Therefore, the only interesting aspect of preemption for the design of scheduling is whether to require it. 108 108 … … 110 110 This section presents a quick overview of several current schedulers. 111 111 While these schedulers do not necessarily represent the most recent advances in scheduling, they are what is generally accessible to programmers. 112 As such, I believe these schedulers are at least as relevant as those presented in published work. 113 Both Schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. 114 However, real-time schedulers are not considered, as these have constraints that are much stricter than what is needed for this project. 112 As such, I believe these schedulers are as relevant as those presented in published work. 113 Both schedulers that operate in kernel space and user space are considered, as both can offer relevant insight for this project. 114 However, real-time schedulers aim to guarantee bounded compute time in order to meet deadlines. 115 These deadlines lead to constraints much stricter than the starvation freedom that is needed for this project. 116 As such real-time schedulers are not considered for this work. 115 117 116 118 \subsection{Operating System Schedulers} 117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.119 Operating System Schedulers tend to be fairly complex, as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate. 118 120 Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X. 119 121 The information is less complete for closed source operating systems. … … 137 139 It also periodically balances the load of the system (according to a different heuristic) but uses a simpler work stealing approach. 138 140 139 \paragraph{Windows (OS)}141 \paragraph{Windows (OS)} 140 142 Microsoft's Operating System's Scheduler~\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. 141 143 It supports 32 levels of priorities, some of which are reserved for real-time and privileged applications. … … 143 145 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. 144 146 145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth.147 The scheduling policy is discussed more in-depth in~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs''. 146 148 Multicore scheduling is based on a combination of priorities and \proc preference. 147 149 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc. … … 168 170 \paragraph{Go}\label{GoSafePoint} 169 171 Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}. 170 Preemption is present, but only at safe points ,~\cite{go:safepoints}which are detection code inserted at various frequent access boundaries.172 Preemption is present, but only at safe points~\cite{go:safepoints}, which are detection code inserted at various frequent access boundaries. 171 173 172 174 The algorithm is as follows : 173 175 \begin{enumerate} 174 176 \item Once out of 61 times, pick 1 element from the \emph{GRQ}. 175 \item If there is an item in the ``chair'' pick it.177 \item Otherwise, if there is an item in the ``chair'' pick it. 176 178 \item Else pick an item from the \emph{LRQ}. 177 179 \begin{itemize} … … 180 182 \end{itemize} 181 183 \end{enumerate} 184 185 Chapter~\ref{microbench} uses Go as one of its comparison point in this thesis's performance evaluation. 182 186 183 187 \paragraph{Erlang} … … 224 228 \paragraph{LibFibre} 225 229 LibFibre~\cite{DBLP:journals/pomacs/KarstenB20} is a lightweight user-level threading framework developed at the University of Waterloo. 226 Similarly to Go, it uses a variation of work stealing with a global queue that has a higher priority than stealing. 227 Unlike Go, it does not have the high-priority next ``chair'' and does not use randomized work-stealing. 230 It shares a very strong resemblance to Go: using a variation of work stealing with a global queue that has a higher priority than stealing. 231 Unlike Go, it does not have the high-priority next ``chair'' and its work-stealing is not randomized. 232 233 Chapter~\ref{microbench} uses LibFibre as one of its comparison point in this thesis's performance evaluation. -
doc/theses/thierry_delisle_PhD/thesis/text/front.tex
r82a90d4 rddcaff6 88 88 \\ 89 89 Internal Member: \> Martin Karsten \\ 90 \> AssociateProfessor, School of Computer Science \\90 \> Professor, School of Computer Science \\ 91 91 \> University of Waterloo \\ 92 92 \end{tabbing} … … 127 127 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 128 128 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 129 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.129 which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated. 130 130 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 131 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources .131 When user-threading parallelism does drop, how and when should idle \glspl{kthrd} be put to sleep to avoid wasting CPU resources? 132 132 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 133 133 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. -
doc/theses/thierry_delisle_PhD/thesis/text/intro.tex
r82a90d4 rddcaff6 1 1 \chapter{Introduction}\label{intro} 2 2 3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages .3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages, like Go~\cite{GITHUB:go}, Java's Project Loom~\cite{MAN:project-loom} and Kotlin~\cite{MAN:kotlin}. 4 4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. 5 5 Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 6 6 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated.7 which raises the question of how many kernel threads are needed and should the number be dynamically reevaluated. 8 8 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources .9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources? 10 10 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 11 11 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. … … 16 16 Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases. 17 17 18 After examining, testing and selecting specific approaches to these scheduling issues, a completelynew scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system.18 After examining, testing and selecting specific approaches to these scheduling issues, a new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system. 19 19 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. 20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with ,the aim of showing equivalent or better performance while offering better fairness.20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness. 21 21 22 22 Chapter~\ref{intro} defines scheduling and its general goals. … … 32 32 Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones. 33 33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}. 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit. 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit\footnote{ 35 Open systems constrasts to \newterm{closed} systems, where work never arrives from external sources. 36 This definition is a extension of open/closed systems in the field of thermodynamics. 37 }. 38 35 39 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. 36 40 For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization. … … 94 98 \end{enumerate} 95 99 96 Scheduling is azero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{100 At a high-level, scheduling is considered zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{ 97 101 Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.} 98 Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns. 102 This almost invariably leads to schedulers needing to pick some \ats over others, opening the door to fairness problems. 103 However, at a lower-level, schedulers can make inefficient or incorrect decisions leading to strictly worse outcomes than what the theoretical zero-sum game suggests. 104 Since it can be difficult to avoid these poor decisions, schedulers are generally a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns. 99 105 For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. 100 106 While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. … … 113 119 Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling. 114 120 Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{ 115 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}121 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd}, reducing parallelism at the user level.} 116 122 117 123 Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals. 118 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).124 More specifically, safety and productivity for scheduling mean supporting a wide range of workloads, so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). 119 125 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 120 126 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. 121 127 122 As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers.128 As a research project, this work runs exclusively on newer versions of the Linux operating system and gcc/clang compilers. 123 129 The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance; 124 130 some of these optimizations rely on interesting hardware optimizations only present on modern CPUs. … … 138 144 An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. 139 145 \item 140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}.146 A mechanism for adding fairness on top of the MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}. 141 147 \item 142 148 An optimization of the helping mechanism for load balancing to reduce scheduling costs. -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r82a90d4 rddcaff6 1 1 \chapter{User Level \io}\label{userio} 2 2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. 3 I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources. 4 When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}). 3 5 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system. 4 6 … … 16 18 This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. 17 19 18 There are three options to monitor file descriptors in Linux:\footnote{20 There are three options to monitor file descriptors (FD) in Linux:\footnote{ 19 21 For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}. 20 22 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.} … … 25 27 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest. 26 28 Hence, the array length must be as long as the largest FD currently of interest. 27 On return, it outputs the set mo tified inplace to identify which of the file descriptors changed state.29 On return, it outputs the set modified in-place to identify which of the file descriptors changed state. 28 30 This destructive change means selecting in a loop requires re-initializing the array for each iteration. 29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.31 Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. 30 32 Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set. 31 33 However, these changes are only reflected when the manager makes its next call to @select@. … … 46 48 However, all three of these I/O systems have limitations. 47 49 The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations. 48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.50 Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed. 49 51 For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@. 50 52 Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors. … … 52 54 \subsection{POSIX asynchronous I/O (AIO)} 53 55 An alternative to @O_NONBLOCK@ is the AIO interface. 54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. 55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations. 56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}. 57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost. 58 Leaving polling for completion, which is similar to the previous system calls. 59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices. 60 It also supports batching multiple operations in a single system call. 61 62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 63 For \io multiplexing, @aio_suspend@ is the best interface. 64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 56 Using AIO, programmers can enqueue operations which are to be performed 57 asynchronously by the kernel. 58 The kernel can communicate 59 completions of these operations in three ways: 60 it can spawn a new \gls{kthrd}; send a Linux signal; or 61 userspace can poll for completion of one or more operations. 62 Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution. 63 Signals and their associated interrupt handlers can also lead to fairly complicated 64 interactions between subsystems, and they have a non-trivial cost. 65 This leaves a single option: polling for completion---this is similar to the previously discussed 66 system calls. 67 While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file 68 descriptors can be regular files or block devices. 69 AIO also supports batching multiple operations in a single system call. 70 71 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 72 Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing. 73 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 65 74 AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests. 66 75 This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests. … … 92 101 A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces. 93 102 Like AIO, it represents \io operations as entries added to a queue. 94 But like @epoll@, new requests can be submitted , while a blocking call waiting for requests to complete,is already in progress.95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ringfrom which programmers poll for completion.103 But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress. 104 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion. 96 105 97 106 One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations. 98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operationslike @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.107 In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc. 99 108 100 109 On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion. … … 106 115 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. 107 116 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. 108 As opposedto C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.117 Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking. 109 118 110 119 \subsection{Discussion} 111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.114 115 For this project, I selected @io_uring@, in large part sbecause of its generality.120 These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete. 121 All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically. 122 For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. 123 124 For this project, I selected @io_uring@, in large part because of its generality. 116 125 While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem. 117 126 118 127 \section{Event-Engine} 119 128 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. 120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returningcontrol to the \gls{proc}.129 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}. 121 130 The parked \ats are then rescheduled by the event engine once the desired operation has been completed. 122 131 … … 125 134 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. 126 135 Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right). 127 The submission ring contains entries,\newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.128 The completion ring contains entries,\newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.136 The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel. 137 The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application. 129 138 The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start; 130 139 the completion ring contains entries for the completed I/O operation. … … 134 143 \centering 135 144 \input{io_uring.pstex_t} 136 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.} 145 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). 146 While the completion ring contains plain data, the submission ring contains only references. 147 These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.} 137 148 \label{fig:iouring} 138 149 \end{figure} … … 153 164 Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations. 154 165 Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated. 155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 andS1 has not been submitted.166 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted. 156 167 \item 157 168 The kernel is notified of the change to the ring using the system call @io_uring_enter@. 158 169 The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned. 159 The @io_uring@ instance can be constructed so this step is not required, but this requireselevated privilege.% and an early version of @io_uring@ had additional restrictions.170 The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions. 160 171 \end{enumerate} 161 172 … … 165 176 When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring. 166 177 Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation. 167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.168 The system call is only needed if the application wants to block waiting for operations to complete.178 The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring. 179 @io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions. 169 180 \end{sloppypar} 170 181 … … 179 190 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time. 180 191 181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer'' \cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.192 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads. 182 193 Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads. 183 194 In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}. 184 195 To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user. 185 Three kinds of operations that can need the \glspl{kthrd} :196 Three kinds of operations that can need the \glspl{kthrd} are: 186 197 187 198 \paragraph{Operations using} @IOSQE_ASYNC@. … … 190 201 \paragraph{Bounded operations.} 191 202 This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices. 192 @io_uring@ must also take this reality into accountby delegating operations on regular files and block devices.203 Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices. 193 204 In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}. 194 205 195 206 \paragraph{Unbounded operations that must be retried.} 196 207 While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket. 197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.208 Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel. 198 209 @io_uring@ maintains a separate pool for these operations. 199 210 The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}. 211 Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel. 200 212 Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@. 201 213 … … 212 224 however, the duration of the system call scales with the number of entries submitted. 213 225 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. 214 Beyond this limit, the length of the system call is the throughput 226 Beyond this limit, the length of the system call is the throughput-limiting factor. 215 227 I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. 216 228 Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. 217 229 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{ 218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}. 219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it. 220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance. 230 As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}. 231 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance. 221 232 222 233 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link. 223 234 SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}). 224 235 The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission. 225 Supporting chains is not a requirement of the\io subsystem, but it is still valuable.236 For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable. 226 237 Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission. 227 238 … … 237 248 To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} 238 249 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. 239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.250 In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}. 240 251 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call. 241 252 Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. … … 244 255 \centering 245 256 \input{pivot_ring.pstex_t} 246 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.257 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition. 247 258 When submitting, the partition is advanced. 248 259 The kernel considers the partition as the head of the ring.} … … 253 264 However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. 254 265 Instances running out of SQEs cannot run \ats wanting to do \io operations. 255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed .266 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur. 256 267 257 268 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. 258 \ atsthat cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.269 \Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. 259 270 While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. 260 271 … … 263 274 In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. 264 275 No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. 265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. 276 However, the \io \gls{proc} is unable to help because it is executing the spinning \at. 277 This results in a deadlock. 266 278 While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''. 267 279 Furthermore, this pattern is difficult to reliably detect and avoid. … … 274 286 \subsubsection{Public Instances} 275 287 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. 276 \ atsattempting an \io operation pick one of the available instances and submit the operation to that instance.288 \Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance. 277 289 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. 278 290 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: … … 282 294 \item 283 295 The scheme to route \io requests to specific @io_uring@ instances does not introduce contention. 284 This aspect has oversized importancebecause it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.296 This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm. 285 297 \end{itemize} 286 298 287 299 Allocation in this scheme is fairly easy. 288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.300 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs. 289 301 Allocation also does not require ordering guarantees as all free SQEs are interchangeable. 290 302 The only added complexity is that the number of SQEs is fixed, which means allocation can fail. … … 312 324 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. 313 325 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. 314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.315 316 With the pool of SQE instances approach, the big advantageis that it is fairly flexible.326 A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. 327 328 The big advantage of the pool of SQE instances approach is that it is fairly flexible. 317 329 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. 318 330 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. … … 320 332 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. 321 333 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. 322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.334 All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases. 323 335 324 336 \subsubsection{Instance borrowing} 325 337 Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}. 326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.338 The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling. 339 The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids. 328 340 When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs. 329 341 However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous. … … 332 344 While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.} 333 345 346 As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm. 334 347 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. 335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiterand issues requests to that instance.348 When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance. 336 349 This instance is now bound to the \gls{proc} the \at is running on. 337 350 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state. … … 343 356 \item The current \gls{proc} does not hold an instance. 344 357 \item The current instance does not have sufficient SQEs to satisfy the request. 345 \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. 358 \item The current \gls{proc} has a wrong instance. 359 This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}. 346 360 \end{enumerate} 347 361 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{ 348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}362 Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.} 349 363 A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag. 350 364 If not, it proceeds, otherwise it delegates the operation to the arbiter. … … 365 379 However, there is no need to immediately revoke the instance. 366 380 External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed. 367 This means whoever is responsible for the system call , first checks ifthe instance has any external submissions.381 This means whoever is responsible for the system call first checks whether the instance has any external submissions. 368 382 If so, it asks the arbiter to revoke the instance and add the external submissions to the ring. 369 383 … … 382 396 383 397 \section{Interface} 384 The last importantpart of the \io subsystem is its interface.398 The final part of the \io subsystem is its interface. 385 399 Multiple approaches can be offered to programmers, each with advantages and disadvantages. 386 The new \ io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.387 The following sections discuss some useful options using @read@ as an example.400 The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. 401 The following sections discuss some useful options, using @read@ as an example. 388 402 The standard Linux interface for C is: 389 403 \begin{cfa} … … 392 406 393 407 \subsection{Replacement} 394 Replacing the C \ glsxtrshort{api}is the more intrusive and draconian approach.395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct themto the \CFA implementation instead of glibc's.396 This rerouting has the advantage of working transparently and supporting existing binaries without ne eding recompilation.397 It also offers a , presumably,well known and familiar API that C programmers can simply continue to work with.398 However, this approach also entails a plethora of subtle technical challenges, which generally boilsdown to making a perfect replacement.399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.408 Replacing the C \io subsystem is the more intrusive and draconian approach. 409 The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's. 410 This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation. 411 It also offers a presumably well known and familiar API that C programmers can simply continue to work with. 412 %However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement. 413 However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs. 414 This approach was rejected as being laudable but infeasible. 401 415 402 416 \subsection{Synchronous Extension} 403 417 Another interface option is to offer an interface different in name only. 418 In this approach, an alternative call is created for each supported system calls. 404 419 For example: 405 420 \begin{cfa} 406 421 ssize_t cfa_read(int fd, void *buf, size_t count); 407 422 \end{cfa} 423 The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking. 424 408 425 This approach is feasible and still familiar to C programmers. 409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.426 It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries. 410 427 However, it has the advantage of implementation simplicity. 411 428 Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io. … … 416 433 future(ssize_t) read(int fd, void *buf, size_t count); 417 434 \end{cfa} 418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytesread, which may be less than the number of bytes requested.435 where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested. 419 436 The data read is placed in @buf@. 420 The problem is that both the bytes readand data form the synchronization object, not just the bytes read.421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.437 The problem is that both the bytes count and data form the synchronization object, not just the bytes read. 438 Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this. 422 439 A classical asynchronous API is: 423 440 \begin{cfa} … … 438 455 However, it is not the most user-friendly option. 439 456 It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@. 457 458 As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension: 459 \begin{cfa} 460 ssize_t cfa_read(int fd, void *buf, size_t count); 461 future(ssize_t) async_read(int fd, void *buf, size_t count); 462 \end{cfa} -
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r82a90d4 rddcaff6 2 2 The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state. 3 3 This chapter addresses problems that occur when the system state changes. 4 Indeed the \CFA runtime ,supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.4 Indeed the \CFA runtime supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically. 5 5 These changes affect the scheduling algorithm, which must dynamically alter its behaviour. 6 6 7 In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.7 Specifically, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios. 8 8 \begin{cfa} 9 9 { … … 26 26 This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason. 27 27 28 There are no performance requirements, within reason, for resizingsince it is expected to be rare.29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.30 It should also avoidas much as possible any effect on performance when the number of \procs remains constant.31 This la terrequirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.28 There are no performance requirements, within reason, for act of resizing itself, since it is expected to be rare. 29 However, this operation has strict correctness requirements, since updating and idle sleep can easily lead to deadlocks. 30 The resizing mechanism should also avoid, as much as possible any effect on performance when the number of \procs remains constant. 31 This last requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays. 32 32 33 33 \subsection{Read-Copy-Update} 34 34 One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}. 35 This is a very common pattern that avoids large critical sections, which is why it is worth mentioning. 35 36 In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy. 36 This approach has the advantage that it may not need any synchronization to do the switch .37 This approach has the advantage that it may not need any synchronization to do the switch, depending on how reclamation of the original copy is handled. 37 38 However, there is a race where \procs still use the original data structure after the copy is switched. 38 39 This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy. … … 63 64 Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 65 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 65 The lock i nnonblocking, so both readers and writers spin while the lock is held.66 The lock is nonblocking, so both readers and writers spin while the lock is held. 66 67 This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations. 67 68 … … 98 99 \section{Idle-Sleep}\label{idlesleep} 99 100 While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs. 100 For this work, it is the programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.101 For this work, it is the application programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue. 101 102 This leaves too many \procs when there are not enough \ats for all the \procs to be useful. 102 103 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease. … … 108 109 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements. 109 110 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore. 110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking},user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.111 The complexity here is to support user-level locking, timers, \io operations, and all other \CFA features with minimal complexity. 111 112 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs. 112 113 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work. … … 115 116 An interesting subpart of this heuristic is what to do with bursts of \ats that become ready. 116 117 Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up. 117 This fact begs the question,if many \procs are available, how many should be woken?118 This fact raises the question: if many \procs are available, how many should be woken? 118 119 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization. 119 120 If the ready \ats will run for a very short time, waking many \procs may be wasteful. 120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis,the behaviour of the scheduler in this particular case is left unspecified.121 As mentioned, since a heuristic to handle these complex cases is outside the scope of this thesis, so the behaviour of the scheduler in this particular case is left unspecified. 121 122 122 123 \section{Sleeping} … … 156 157 The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc. 157 158 On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed. 158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed.159 Unlike regular work-stealing, this search must be exhaustive to make sure that no pre-existing \at is missed. 159 160 These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc. 160 161 Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search. … … 188 189 \centering 189 190 \input{idle1.pstex_t} 190 \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.191 \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put onto a doubly-linked stack protected by a lock. 191 192 Each \proc has a private event \lstinline{fd}.} 192 193 \label{fig:idle1} -
doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex
r82a90d4 rddcaff6 6 6 \Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@. 7 7 Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC). 8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.8 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does. 9 9 This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel. 10 10 The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking. … … 38 38 39 39 \section{\glsxtrshort{io}}\label{prev:io} 40 Prior to this work, the \CFA runtime did not addany particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:40 Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows: 41 41 42 42 \begin{quote} … … 49 49 \end{quote} 50 50 51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.51 Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations. 52 52 This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}. 53 53 The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel. … … 56 56 57 57 \section{Interoperating with C} 58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: 58 While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls. 59 The POSIX standard states~\cite[\S~2.9.1]{POSIX17}: 59 60 \begin{quote} 60 61 All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions) 61 62 \end{quote} 62 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.63 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model. 63 64 64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. 65 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. 66 Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. 65 67 66 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences: 68 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. 69 Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. 70 This fine-grained interoperability between C and \CFA has two consequences: 67 71 \begin{enumerate} 68 72 \item Precisely identifying blocking C calls is difficult. … … 71 75 Because of these consequences, this work does not attempt to ``sandbox'' calls to C. 72 76 Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model. 77 Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs. 73 78 Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications. 74 Therefore, a complete solution to this problem is outside the scope of this thesis.\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. This capability further reduces the chances of blocking a \gls{kthrd}.} 79 Therefore, a complete solution to this problem is outside the scope of this thesis. 80 \footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks. 81 This capability further reduces the chances of blocking a \gls{kthrd}.} 82 Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}.
Note: See TracChangeset
for help on using the changeset viewer.