Changeset fc96890
- Timestamp:
- Sep 13, 2022, 3:07:25 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 3606fe4
- Parents:
- 1c334d1
- Location:
- doc/theses/thierry_delisle_PhD/thesis/text
- Files:
-
- 9 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex
r1c334d1 rfc96890 2 2 3 3 Building the \CFA runtime has been a challenging project. 4 The work was divided between high-level concurrency design and a user-level threading runtime (Masters' s thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).4 The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis). 5 5 Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. 6 6 This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future. 7 7 8 I believed my Masters' swork would provide the background to make the Ph.D. work reasonably straightforward.8 I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward. 9 9 However, I discovered two significant challenges. 10 10 … … 20 20 Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading. 21 21 There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance. 22 To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi -processor computers and concurrency knowledge was just developing.22 To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing. 23 23 Unfortunately, little has changed in the intervening years. 24 24 … … 26 26 The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; 27 27 hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. 28 Merging all these different I/Omechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.28 Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms. 29 29 The negative is that @io_uring@ is new and developing. 30 30 As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. … … 75 75 76 76 \section{Future Work} 77 While the \CFA runtime achieves a better compromise ,than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.77 While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. 78 78 Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. 79 79 80 80 \subsection{Idle Sleep} 81 A difficult challenge, not fully addressed in this thesis, is idle -sleep.81 A difficult challenge, not fully addressed in this thesis, is idle sleep. 82 82 While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. 83 83 The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. … … 88 88 \begin{itemize} 89 89 \item 90 The mechanism uses a hand -shake between notification and sleep to ensure that no \at is missed.90 The mechanism uses a handshake between notification and sleep to ensure that no \at is missed. 91 91 \item 92 The hand -shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.92 The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. 93 93 \item 94 94 Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. -
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r1c334d1 rfc96890 13 13 To match expectations, the design must offer the programmer sufficient guarantees so that, as long as they respect the execution mental model, the system also respects this model. 14 14 15 For threading, a simple and common execution mental model is the `` Ideal multi-tasking CPU'' :15 For threading, a simple and common execution mental model is the ``ideal multitasking CPU'' : 16 16 17 17 \begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}] 18 {[The]} `` Ideal multi-tasking CPU'' is a (non-existent :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed. For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel.18 {[The]} ``ideal multi-tasking CPU'' is a (non-existent :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed. For example: if there are 2 running tasks, then it runs each at 50\% physical power --- i.e., actually in parallel. 19 19 \label{q:LinuxCFS} 20 20 \end{displayquote} … … 22 22 Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model. 23 23 24 In general, the expectation at the cent erof this model is that ready \ats do not interfere with each other but simply share the hardware.24 In general, the expectation at the centre of this model is that ready \ats do not interfere with each other but simply share the hardware. 25 25 This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. 26 26 This expectation of \at independence means the scheduler is expected to offer two guarantees: … … 31 31 32 32 It is important to note that these guarantees are expected only up to a point. 33 \Glspl{at} that are ready to run should not be prevented to doso, but they still share the limited hardware resources.33 \Glspl{at} that are ready to run should not be prevented from doing so, but they still share the limited hardware resources. 34 34 Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. 35 35 … … 92 92 \subsubsection{Scalability} 93 93 The most basic performance challenge of a scheduler is scalability. 94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue s\ats.95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can lead to worstimprovements.94 Given a large number of \procs and an even larger number of \ats, scalability measures how fast \procs can enqueue and dequeue \ats. 95 One could expect that doubling the number of \procs would double the rate at which \ats are dequeued, but contention on the internal data structure of the scheduler can diminish the improvements. 96 96 While the ready queue itself can be sharded to alleviate the main source of contention, auxiliary scheduling features, \eg counting ready \ats, can also be sources of contention. 97 97 … … 108 108 The problem is a single point of contention when adding/removing \ats. 109 109 As shown in the evaluation sections, most production schedulers do scale when adding \glspl{hthrd}. 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready 110 The solution to this problem is to shard the ready queue: create multiple \emph{sub-queues} forming the logical ready-queue and the sub-queues are accessed by multiple \glspl{hthrd} without interfering. 111 111 112 112 Before going into the design of \CFA's scheduler, it is relevant to discuss two sharding solutions that served as the inspiration scheduler in this thesis. … … 114 114 \subsection{Work-Stealing} 115 115 116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready -queue is work-stealing.116 As mentioned in \ref{existing:workstealing}, a popular sharding approach for the ready queue is work-stealing. 117 117 In this approach, each \gls{proc} has its own local sub-queue and \glspl{proc} only access each other's sub-queue if they run out of work on their local ready-queue. 118 118 The interesting aspect of work stealing happens in the steady-state scheduling case, \ie all \glspl{proc} have work and no load balancing is needed. … … 134 134 Timestamps are added to each element of a sub-queue. 135 135 \item 136 A \gls{proc} randomly tests ready -queues until it has acquired one or two queues.136 A \gls{proc} randomly tests ready queues until it has acquired one or two queues. 137 137 \item 138 138 If two queues are acquired, the older of the two \ats is dequeued from the front of the acquired queues. … … 246 246 247 247 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamp, as shown in Figure~\ref{fig:base-ma}. 248 Note ,that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended.248 Note that this is more complex because the \at at the head of a sub-queue is still waiting, so its wait time has not ended. 249 249 Therefore, the exponential moving average is an average of how long each dequeued \at has waited. 250 250 To compare sub-queues, the timestamp at the head must be compared to the current time, yielding the best-case wait time for the \at at the head of the queue. … … 259 259 Conversely, the active sub-queues do not benefit much from helping since starvation is already a non-issue. 260 260 This puts this algorithm in the awkward situation of paying for a largely unnecessary cost. 261 The good news is that this problem can be mitigated 261 The good news is that this problem can be mitigated. 262 262 263 263 \subsection{Redundant Timestamps}\label{relaxedtimes} … … 279 279 \input{base_ts2.pstex_t} 280 280 \caption[\CFA design with Redundant Timestamps]{\CFA design with Redundant Timestamps \smallskip\newline An array is added containing a copy of the timestamps. 281 These timestamps are written 281 These timestamps are written-to with relaxed atomics, so there is no order among concurrent memory accesses, leading to fewer cache invalidations.} 282 282 \label{fig:base-ts2} 283 283 \end{figure} … … 292 292 With redundant timestamps, this scheduling algorithm achieves both the fairness and performance requirements on most machines. 293 293 The problem is that the cost of polling and helping is not necessarily consistent across each \gls{hthrd}. 294 For example ,on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket.294 For example on machines with a CPU containing multiple hyper threads and cores and multiple CPU sockets, cache misses can be satisfied from the caches on the same (local) CPU, or by a CPU on a different (remote) socket. 295 295 Cache misses satisfied by a remote CPU have significantly higher latency than from the local CPU. 296 296 However, these delays are not specific to systems with multiple CPUs. … … 344 344 Therefore, the approach used in the \CFA scheduler is to have per-\proc sub-queues, but have an explicit data structure to track which cache substructure each sub-queue is tied to. 345 345 This tracking requires some finesse because reading this data structure must lead to fewer cache misses than not having the data structure in the first place. 346 A key element howeveris that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}.346 A key element, however, is that, like the timestamps for helping, reading the cache instance mapping only needs to give the correct result \emph{often enough}. 347 347 Therefore the algorithm can be built as follows: before enqueueing or dequeuing a \at, each \proc queries the CPU id and the corresponding cache instance. 348 348 Since sub-queues are tied to \procs, each \proc can then update the cache instance mapped to the local sub-queue(s). -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r1c334d1 rfc96890 13 13 Memcached~\cite{memcached} is an in-memory key-value store used in many production environments, \eg \cite{atikoglu2012workload}. 14 14 The Memcached server is so popular there exists a full-featured front-end for performance testing, called @mutilate@~\cite{GITHUB:mutilate}. 15 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets.15 Experimenting on Memcached allows for a simple test of the \CFA runtime as a whole, exercising the scheduler, the idle-sleep mechanism, as well as the \io subsystem for sockets. 16 16 Note that this experiment does not exercise the \io subsystem with regard to disk operations because Memcached is an in-memory server. 17 17 … … 48 48 For UDP connections, all the threads listen to a single UDP socket for incoming requests. 49 49 Threads that are not currently dealing with another request ignore the incoming packet. 50 One of the remaining, non busy, threads reads the request and sends the response.50 One of the remaining, non-busy, threads reads the request and sends the response. 51 51 This implementation can lead to increased CPU \gls{load} as threads wake from sleep to potentially process the request. 52 52 \end{itemize} … … 110 110 Again, each experiment is run 15 times with the median, maximum and minimum plotted with different lines. 111 111 As expected, the latency starts low and increases as the server gets close to saturation, at which point, the latency increases dramatically because the web servers cannot keep up with the connection rate so client requests are disproportionally delayed. 112 Because of this dramatic increase, the Y 112 Because of this dramatic increase, the Y-axis is presented using a log scale. 113 113 Note that the graph shows the \emph{target} query rate, the actual response rate is given in Figure~\ref{fig:memcd:rate:qps} as this is the same underlying experiment. 114 114 … … 117 117 Vanilla Memcached achieves the lowest latency until 600K, after which all the web servers are struggling to respond to client requests. 118 118 \CFA begins to decline at 600K, indicating some bottleneck after saturation. 119 Overall, all three web servers achieve micro -second latencies and the increases in latency mostly follow each other.119 Overall, all three web servers achieve microsecond latencies and the increases in latency mostly follow each other. 120 120 121 121 \subsection{Update rate} … … 269 269 \begin{itemize} 270 270 \item 271 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load -generator to run on a separate CPU.271 A client runs a 2.6.11-1 SMP Linux kernel, which permits each client load generator to run on a separate CPU. 272 272 \item 273 273 It has two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. … … 285 285 To measure web server throughput, the server computer is loaded with 21,600 files, sharded across 650 directories, occupying about 2.2GB of disk, distributed over the server's RAID-5 4-drives to achieve high throughput for disk I/O. 286 286 The clients run httperf~\cite{httperf} to request a set of static files. 287 The httperf load -generator is used with session files to simulate a large number of users and to implement a partially open-loop system.287 The httperf load generator is used with session files to simulate a large number of users and to implement a partially open-loop system. 288 288 This permits httperf to produce overload conditions, generate multiple requests from persistent HTTP/1.1 connections, and include both active and inactive off periods to model browser processing times and user think times~\cite{Barford98}. 289 289 290 290 The experiments are run with 16 clients, each running a copy of httperf (one copy per CPU), requiring a set of 16 log files with requests conforming to a Zipf distribution. 291 291 This distribution is representative of users accessing static data through a web browser. 292 Each request reads a file name from its trace, establishes a connection, performs an HTTP get-request for the file name, receives the file data, closes the connection, and repeats the process.292 Each request reads a file name from its trace, establishes a connection, performs an HTTP GET request for the file name, receives the file data, closes the connection, and repeats the process. 293 293 Some trace elements have multiple file names that are read across a persistent connection. 294 294 A client times out if the server does not complete a request within 10 seconds. -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r1c334d1 rfc96890 4 4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. 5 5 All of these systems have a \gls{uthrding} model. 6 The goal of this chapter is show that the \CFA scheduler obtains equivalent performance to other less fairschedulers through the different experiments.6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments. 7 7 Note that only the code of the \CFA tests is shown; 8 8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}. … … 48 48 49 49 For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}). 50 Scalability uses the same data as throughput but the Y 50 Scalability uses the same data as throughput but the Y-axis is calculated as the number of \procs over the throughput. 51 51 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same. 52 52 … … 171 171 \CFA, Go and Tokio all obtain effectively the same throughput performance. 172 172 Libfibre is slightly behind in this case but still scales decently. 173 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flatt ing of the line.173 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement for all runtimes, which can be seen as a flattening of the line. 174 174 This effect even causes a decrease in throughput in libfibre's case. 175 175 As expected, this pattern repeats between \proc count 72 and 96. … … 180 180 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism. 181 181 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available. 182 Indeed, unlike the left column, it is likely that the ready -queue is transiently empty, which likely triggers additional synchronization steps.182 Indeed, unlike the left column, it is likely that the ready queue is transiently empty, which likely triggers additional synchronization steps. 183 183 Interestingly, libfibre achieves better performance with 1 cycle. 184 184 … … 196 196 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. 197 197 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fairschedulers.198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers. 199 199 200 200 \section{Yield} … … 258 258 Figures~\ref{fig:yield:jax} and \ref{fig:yield:nasus} show the results for the yield experiment on Intel and AMD, respectively. 259 259 Looking at the left column on Intel, Figures~\ref{fig:yield:jax:ops} and \ref{fig:yield:jax:ns} show the results for 100 \ats for each \proc. 260 Note , taht the Y-axis on this graph is twice as large as the Intel cycle graph.260 Note that the Y-axis on this graph is twice as large as the Intel cycle graph. 261 261 A visual glance between the left columns of the cycle and yield graphs confirms my claim that the yield benchmark is unreliable. 262 262 \CFA has no special handling for @yield@, but this experiment requires less synchronization than the @cycle@ experiment. 263 263 Hence, the @yield@ throughput and scalability graphs have similar shapes to the corresponding @cycle@ graphs. 264 264 The only difference is slightly better performance for @yield@ because of less synchronization. 265 Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, by -passing the idle-sleep mechanism entirely.265 Libfibre has special handling for @yield@ using the fact that the number of ready fibres does not change, and therefore, bypassing the idle-sleep mechanism entirely. 266 266 Hence, libfibre behaves very differently in the cycle and yield benchmarks, with a 4 times increase in performance on the left column. 267 267 Go has special handling for @yield@ by putting a yielding goroutine on a secondary global ready-queue, giving it a lower priority. … … 330 330 331 331 The Cycle and Yield benchmarks represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application. 332 In these benchmarks ,\ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.332 In these benchmarks \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other. 333 333 334 334 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. … … 486 486 487 487 In conclusion, the objective of this benchmark is to demonstrate that \glslink{atsched}{unparking} \ats from remote \procs does not cause too much contention on the local queues. 488 Indeed, the fact that most runtimes achieve some scaling between various \proc count demonstratemigrations do not need to be serialized.488 Indeed, the fact that most runtimes achieve some scaling between various \proc counts demonstrates migrations do not need to be serialized. 489 489 Again these results demonstrate that \CFA achieves satisfactory performance compared to the other runtimes. 490 490 … … 497 497 In the noshare variation, the array is not passed on and each thread continuously accesses its private array. 498 498 In the share variation, the array is passed to another thread via the semaphore's shadow queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread. 499 Figure~\ref{fig:locality:code} shows the pseudo -code for this benchmark.499 Figure~\ref{fig:locality:code} shows the pseudo code for this benchmark. 500 500 501 501 The objective here is to highlight the different decisions made by the runtime when \glslink{atsched}{unparking}. … … 567 567 \CFA and Tokio slightly outperform libfibre, as expected, based on their \ats placement approach. 568 568 \CFA and Tokio both \unpark locally and do not suffer cache misses on the transferred array. 569 Libfibre on the other handunparks remotely, and as such the unparked \at is likely to miss on the shared data.569 Libfibre, on the other hand, unparks remotely, and as such the unparked \at is likely to miss on the shared data. 570 570 Go trails behind in this experiment, presumably for the same reasons that were observable in the churn benchmark. 571 571 Otherwise, the results are similar to the churn benchmark, with lower throughput due to the array processing. … … 645 645 Again the overall performance is higher and slightly more variation is visible. 646 646 Looking at the left column first, Figures~\ref{fig:locality:nasus:share:ops} and \ref{fig:locality:nasus:share:ns}, \CFA and Tokio still outperform libfibre, this time more significantly. 647 This advantage is expected from the AMD server with its smaller and more narrowcaches that magnify the costs of processing the array.647 This advantage is expected from the AMD server with its smaller and narrower caches that magnify the costs of processing the array. 648 648 Go still has the same poor performance as on Intel. 649 649 … … 725 725 The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once. 726 726 The benchmark effectively provides the fairness guarantee in this case. 727 In the yielding version, however the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable.727 In the yielding version, however, the benchmark provides no such guarantee, which means the scheduler has full responsibility and any unfairness is measurable. 728 728 729 729 While this is an artificial scenario, in real life it requires only a few simple pieces. -
doc/theses/thierry_delisle_PhD/thesis/text/existing.tex
r1c334d1 rfc96890 5 5 6 6 In general, \emph{selecting} a scheduling algorithm depends on how much information is available to the scheduler. 7 Workloads that are well -known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms.7 Workloads that are well known, consistent, and homogeneous can benefit from a scheduler that is optimized to use this information, while ill-defined, inconsistent, heterogeneous workloads require general non-optimal algorithms. 8 8 A secondary aspect is how much information can be gathered versus how much information must be given as part of the scheduler input. 9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well -informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information.9 This information adds to the spectrum of scheduling algorithms, going from static schedulers that are well informed from the start, to schedulers that gather most of the information needed, to schedulers that can only rely on very limited information. 10 10 Note, this description includes both information about each request, \eg time to complete or resources needed, and information about the relationships among requests, \eg whether some requests must be completed before another request starts. 11 11 … … 77 77 In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead. 78 78 The best performance generally means finding a middle ground between the two. 79 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse 79 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse-grained. 80 80 81 81 \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc. … … 97 97 \cite{DBLP:conf/ipps/ColeR13} also studied how randomized work-stealing affects false sharing among \ats. 98 98 99 However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully -strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph.100 It is unclear how well these distributions represent workloads in real 99 However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentioning that this theoretical research has mainly focused on ``fully strict'' computations, \ie workloads that can be fully represented with a direct acyclic graph. 100 It is unclear how well these distributions represent workloads in real-world scenarios. 101 101 102 102 \section{Preemption} … … 115 115 116 116 \subsection{Operating System Schedulers} 117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real -time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate.117 Operating System Schedulers tend to be fairly complex as they generally support some amount of real time, aim to balance interactive and non-interactive \ats and support multiple users sharing hardware without requiring these users to cooperate. 118 118 Here are more details on a few schedulers used in the common operating systems: Linux, FreeBSD, Microsoft Windows and Apple's OS X. 119 119 The information is less complete for closed source operating systems. … … 130 130 The issues highlighted stem from Linux's need to support fairness across \ats \emph{and} across users\footnote{Enforcing fairness across users means that given two users, one with a single \at and the other with one thousand \ats, the user with a single \at does not receive one-thousandth of the CPU time.}, increasing the complexity. 131 131 132 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO -scheduler that adds fixed time slices. \cite{MAN:linux/sched}132 Linux also offers a FIFO scheduler, a real-time scheduler, which runs the highest-priority \ats, and a round-robin scheduler, which is an extension of the FIFO scheduler that adds fixed time slices. \cite{MAN:linux/sched} 133 133 134 134 \paragraph{FreeBSD} … … 144 144 145 145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in-depth. 146 Multicore scheduling is based on a combination of priorities and \proc prefer ance.146 Multicore scheduling is based on a combination of priorities and \proc preference. 147 147 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc. 148 148 \Glspl{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on. … … 195 195 \item The task returned by \textit{t}@.execute()@ 196 196 \item The successor of t if \textit{t} was its last completed predecessor. 197 \item A task popped from the end of the thread's own deque.197 \item A task popped from the end of the thread's own queue. 198 198 \item A task with an affinity for the thread. 199 199 \item A task popped from approximately the beginning of the shared queue. 200 \item A task popped from the beginning of another randomly chosen thread's deque.200 \item A task popped from the beginning of another randomly chosen thread's queue. 201 201 \end{enumerate} 202 202 -
doc/theses/thierry_delisle_PhD/thesis/text/intro.tex
r1c334d1 rfc96890 3 3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. 4 4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. 5 Indeed, over-partitioning into small work -units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.5 Indeed, over-partitioning into small work units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 6 6 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 7 7 which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated. … … 11 11 otherwise, other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. 12 12 13 This thesis analy ses multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}.14 The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel -thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.13 This thesis analyzes multiple scheduler systems, where each system attempts to fulfill the requirements for \gls{uthrding}. 14 The predominant technique for managing high levels of concurrency is sharding the ready queue with one queue per kernel thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. 15 15 Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. 16 Fairness is handled through preemption and/or ad -hoc solutions, which leads to coarse-grained fairness with some pathological cases.16 Fairness is handled through preemption and/or ad hoc solutions, which leads to coarse-grained fairness with some pathological cases. 17 17 18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime -system.18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime system. 19 19 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. 20 20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness. … … 35 35 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. 36 36 For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization. 37 A general-purpose dynamic-scheduler for an open systemcannot anticipate work requests, so its performance is rarely optimal.37 In an open system, a general-purpose dynamic scheduler cannot anticipate work requests, so its performance is rarely optimal. 38 38 Even with complete knowledge of arrival order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}. 39 39 However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns. … … 53 53 timer alarm for preemption (running $\rightarrow$ ready) 54 54 \item 55 long 55 long-term delay versus spinning (running $\rightarrow$ blocked) 56 56 \item 57 57 completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready) … … 84 84 85 85 \noindent 86 Essentially, all multi -processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.86 Essentially, all multiprocessor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. 87 87 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. 88 88 That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors. … … 120 120 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. 121 121 122 As a research project, this work builds exclusively on newer versions of the Linux operating -system and gcc/clang compilers.122 As a research project, this work builds exclusively on newer versions of the Linux operating system and gcc/clang compilers. 123 123 The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance; 124 124 some of these optimizations rely on interesting hardware optimizations only present on modern CPUs. 125 The \io implementation is based on the @io_uring@ kernel -interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.125 The \io implementation is based on the @io_uring@ kernel interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io. 126 126 This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc. 127 127 While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility. … … 129 129 130 130 \section{Contributions}\label{s:Contributions} 131 This work provides the following scheduling contributions for advanced \gls{uthrding} runtime -systems:131 This work provides the following scheduling contributions for advanced \gls{uthrding} runtime systems: 132 132 \begin{enumerate}[leftmargin=*] 133 133 \item … … 140 140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}. 141 141 \item 142 An optimization of the helping -mechanism for load balancing to reduce scheduling costs.142 An optimization of the helping mechanism for load balancing to reduce scheduling costs. 143 143 \item 144 144 An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases. -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r1c334d1 rfc96890 53 53 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. 54 54 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations. 55 For this work, spawning a new \gls{kthrd} is counter -productive but a related solution is discussed in Section~\ref{io:morethreads}.55 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}. 56 56 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost. 57 57 Leaving polling for completion, which is similar to the previous system calls. … … 100 100 101 101 \subsection{Extra Kernel Threads}\label{io:morethreads} 102 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad -hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.102 Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing. 103 103 In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading. 104 104 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run. … … 290 290 291 291 Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. 292 Furthermore, the routing algorithm should block operations up -front,if none of the instances have available SQEs.292 Furthermore, the routing algorithm should block operations upfront if none of the instances have available SQEs. 293 293 294 294 Once an SQE is allocated, \ats insert the \io request information and keep track of the SQE index and the instance it belongs to. 295 295 296 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread -safe, and then the kernel must be notified using the @io_uring_enter@ system call.296 Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread safe, and then the kernel must be notified using the @io_uring_enter@ system call. 297 297 The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean an SQE multiple times in the ring buffer, which is undefined behaviour. 298 298 However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations are complete. … … 307 307 Once the system call is done, the submitter must also free SQEs so that the allocator can reuse them. 308 308 309 Finally, the completion side is much simpler since the @io_uring@ system -call enforces a natural synchronization point.309 Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point. 310 310 Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats. 311 311 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. … … 368 368 369 369 \paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request. 370 Otherwise, it must hold on to the list of threads until SQEs are made available again.370 Otherwise, it must hold on to the list of threads until SQEs are made available again. 371 371 This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs. 372 372 … … 383 383 The last important part of the \io subsystem is its interface. 384 384 Multiple approaches can be offered to programmers, each with advantages and disadvantages. 385 The new \io subsystem can replace the C runtime API or extend it, and in the lat er case, the interface can go from very similar to vastly different.385 The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. 386 386 The following sections discuss some useful options using @read@ as an example. 387 The standard Linux interface for C is 387 The standard Linux interface for C is: 388 388 \begin{cfa} 389 389 ssize_t read(int fd, void *buf, size_t count); … … 394 394 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's. 395 395 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation. 396 It also offers a, presumably, well -known and familiar API that C programmers can simply continue to work with.396 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with. 397 397 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement. 398 398 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs. -
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r1c334d1 rfc96890 55 55 A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader. 56 56 Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance. 57 Since this approach is not a very complex challenge and an ad -hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.57 Since this approach is not a very complex challenge and an ad hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken. 58 58 59 59 To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section. … … 61 61 The read-acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock. 62 62 The writer acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks. 63 Acquiring all the local read 63 Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 64 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 65 65 The lock in nonblocking, so both readers and writers spin while the lock is held. … … 113 113 Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work. 114 114 115 An interesting sub -part of this heuristic is what to do with bursts of \ats that become ready.115 An interesting subpart of this heuristic is what to do with bursts of \ats that become ready. 116 116 Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up. 117 117 This fact begs the question, if many \procs are available, how many should be woken? 118 118 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization. 119 If the ready \ats will run for a shortvery short time, waking many \procs may be wasteful.119 If the ready \ats will run for a very short time, waking many \procs may be wasteful. 120 120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified. 121 121 … … 173 173 174 174 This approach also simplifies notification. 175 Indeed, \procs not only need to be notified when a new \at is readied, but also mustbe notified during manual resizing, so the \gls{kthrd} can be joined.175 Indeed, \procs not only need to be notified when a new \at is readied, but must also be notified during manual resizing, so the \gls{kthrd} can be joined. 176 176 These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order. 177 177 Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure. -
doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex
r1c334d1 rfc96890 17 17 The \CFA M:N threading model is implemented using many user-level threads mapped onto fewer \glspl{kthrd}. 18 18 The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack. 19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context -switches out, it then chooses a different \at to run.19 The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run. 20 20 21 21 \section{Clusters} … … 60 60 All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions) 61 61 \end{quote} 62 Only UNIX @man@ pages identify whether or nota library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model.62 Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model. 63 63 64 64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
Note: See TracChangeset
for help on using the changeset viewer.