Changeset 71cf630 for doc/theses/thierry_delisle_PhD/thesis/text
- Timestamp:
- Aug 16, 2022, 4:04:47 PM (4 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, stuck-waitfor-destruct
- Children:
- aec2c022
- Parents:
- 741e22c (diff), 17c6edeb (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)links above to see all the changes relative to each parent. - Location:
- doc/theses/thierry_delisle_PhD/thesis/text
- Files:
-
- 1 added
- 9 edited
-
conclusion.tex (added)
-
core.tex (modified) (6 diffs)
-
eval_macro.tex (modified) (6 diffs)
-
eval_micro.tex (modified) (20 diffs)
-
existing.tex (modified) (6 diffs)
-
front.tex (modified) (3 diffs)
-
intro.tex (modified) (5 diffs)
-
io.tex (modified) (4 diffs)
-
practice.tex (modified) (4 diffs)
-
runtime.tex (modified) (2 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r741e22c r71cf630 15 15 For threading, a simple and common execution mental-model is the ``Ideal multi-tasking CPU'' : 16 16 17 \begin{displayquote}[Linux CFS\cit {https://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt}]17 \begin{displayquote}[Linux CFS\cite{MAN:linux/cfs}] 18 18 {[The]} ``Ideal multi-tasking CPU'' is a (non-existent :-)) CPU that has 100\% physical power and which can run each task at precise equal speed, in parallel, each at [an equal fraction of the] speed. For example: if there are 2 tasks running, then it runs each at 50\% physical power --- i.e., actually in parallel. 19 19 \label{q:LinuxCFS} … … 183 183 This suggests to the following approach: 184 184 185 \subsection{Dynamic Entropy}\cit {https://xkcd.com/2318/}185 \subsection{Dynamic Entropy}\cite{xkcd:dynamicentropy} 186 186 The Relaxed-FIFO approach can be made to handle the case of mostly empty subqueues by tweaking the \glsxtrlong{prng}. 187 187 The \glsxtrshort{prng} state can be seen as containing a list of all the future subqueues that will be accessed. 188 188 While this concept is not particularly useful on its own, the consequence is that if the \glsxtrshort{prng} algorithm can be run \emph{backwards}, then the state also contains a list of all the subqueues that were accessed. 189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cit {https://en.wikipedia.org/wiki/Linear\_congruential\_generator} support running the algorithm backwards while offering good quality and performance.189 Luckily, bidirectional \glsxtrshort{prng} algorithms do exist, \eg some Linear Congruential Generators\cite{wiki:lcg} support running the algorithm backwards while offering good quality and performance. 190 190 This particular \glsxtrshort{prng} can be used as follows: 191 191 \begin{itemize} … … 208 208 The alternative is to do it the other way around. 209 209 210 \section{Work Stealing++} 210 \section{Work Stealing++}\label{helping} 211 211 To add stronger fairness guarantees to work stealing a few changes are needed. 212 212 First, the relaxed-FIFO algorithm has fundamentally better fairness because each \proc always monitors all subqueues. … … 220 220 \input{base.pstex_t} 221 221 \caption[Base \CFA design]{Base \CFA design \smallskip\newline A pool of subqueues offers the sharding, two per \glspl{proc}. 222 Each \gls{proc} can access all of the subqueues. 222 Each \gls{proc} can access all of the subqueues. 223 223 Each \at is timestamped when enqueued.} 224 224 \label{fig:base} … … 245 245 \end{figure} 246 246 247 A simple solution to this problem is to use an exponential moving average\cit {https://en.wikipedia.org/wiki/Moving\_average\#Exponential\_moving\_average} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}.247 A simple solution to this problem is to use an exponential moving average\cite{wiki:ma} (MA) instead of a raw timestamps, shown in Figure~\ref{fig:base-ma}. 248 248 Note, this is more complex because the \at at the head of a subqueue is still waiting, so its wait time has not ended. 249 249 Therefore, the exponential moving average is actually an exponential moving average of how long each dequeued \at has waited. … … 261 261 The good news is that this problem can be mitigated 262 262 263 \subsection{Redundant Timestamps} 263 \subsection{Redundant Timestamps}\ref{relaxedtimes} 264 264 The problem with polling remote subqueues is that correctness is critical. 265 265 There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion. -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r741e22c r71cf630 10 10 11 11 \section{Memcached} 12 Memcached~\cit{memcached} is an in memory key-value store that is used in many production environments, \eg \cit{Berk Atikoglu et al., Workload Analysis of a Large-Scale Key-Value Store, 13 SIGMETRICS 2012}. 14 This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cit{mutilate}. 12 Memcached~\cite{memcached} is an in memory key-value store that is used in many production environments, \eg \cite{atikoglu2012workload}. 13 This also server also has the notable added benefit that there exists a full-featured front-end for performance testing called @mutilate@~\cite{GITHUB:mutilate}. 15 14 Experimenting on memcached allows for a simple test of the \CFA runtime as a whole, it will exercise the scheduler, the idle-sleep mechanism, as well the \io subsystem for sockets. 16 15 This experiment does not exercise the \io subsytem with regards to disk operations. 17 The experiments compare 3 different varitions of memcached:18 \begin{itemize}19 \item \emph{vanilla}: the official release of memcached, version~1.6.9.20 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}.21 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA.22 \end{itemize}23 16 24 17 \subsection{Benchmark Environment} … … 31 24 The network route uses 1 Mellanox SX1012 10/40 Gigabit Ethernet cluster switch. 32 25 33 \subsection{Throughput} 26 \subsection{Memcached with threads per connection} 27 Comparing against memcached using a user-level runtime only really make sense if the server actually uses this threading model. 28 Indeed, evaluating a user-level runtime with 1 \at per \proc is not meaningful since it does not exercise the runtime, it simply adds some overhead to the underlying OS scheduler. 29 30 One approach is to use a webserver that uses a thread-per-connection model, where each incoming connection is served by a single \at in a strict 1-to-1 pairing. 31 This models adds flexibility to the implementation, as the serving logic can now block on user-level primitives without affecting other connections. 32 33 Memcached is not built according to a thread-per-connection model, but there exists a port of it that is, which was built for libfibre in \cite{DBLP:journals/pomacs/KarstenB20}. 34 Therefore this version can both be compared to the original version and to a port to the \CFA runtime. 35 36 As such, this memcached experiment compares 3 different varitions of memcached: 37 \begin{itemize} 38 \item \emph{vanilla}: the official release of memcached, version~1.6.9. 39 \item \emph{fibre}: a modification of vanilla which uses the thread per connection model on top of the libfibre runtime~\cite{DBLP:journals/pomacs/KarstenB20}. 40 \item \emph{cfa}: a modification of the fibre webserver that replaces the libfibre runtime with \CFA. 41 \end{itemize} 42 43 \subsection{Throughput} \label{memcd:tput} 34 44 \begin{figure} 35 45 \centering … … 62 72 \begin{figure} 63 73 \centering 64 \input{result.memcd.updt.qps.pstex_t} 65 \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description} 66 \label{fig:memcd:updt:qps} 67 \end{figure} 68 69 \begin{figure} 70 \centering 71 \input{result.memcd.updt.lat.pstex_t} 72 \caption[Churn Benchmark : Throughput on Intel]{Churn Benchmark : Throughput on Intel\smallskip\newline Description} 73 \label{fig:memcd:updt:lat} 74 \subfloat[][Throughput]{ 75 \input{result.memcd.forall.qps.pstex_t} 76 } 77 78 \subfloat[][Latency]{ 79 \input{result.memcd.forall.lat.pstex_t} 80 } 81 \caption[forall Latency results at different update rates]{forall Latency results at different update rates\smallskip\newline Description} 82 \label{fig:memcd:updt:forall} 83 \end{figure} 84 85 \begin{figure} 86 \centering 87 \subfloat[][Throughput]{ 88 \input{result.memcd.fibre.qps.pstex_t} 89 } 90 91 \subfloat[][Latency]{ 92 \input{result.memcd.fibre.lat.pstex_t} 93 } 94 \caption[fibre Latency results at different update rates]{fibre Latency results at different update rates\smallskip\newline Description} 95 \label{fig:memcd:updt:fibre} 96 \end{figure} 97 98 \begin{figure} 99 \centering 100 \subfloat[][Throughput]{ 101 \input{result.memcd.vanilla.qps.pstex_t} 102 } 103 104 \subfloat[][Latency]{ 105 \input{result.memcd.vanilla.lat.pstex_t} 106 } 107 \caption[vanilla Latency results at different update rates]{vanilla Latency results at different update rates\smallskip\newline Description} 108 \label{fig:memcd:updt:vanilla} 74 109 \end{figure} 75 110 … … 79 114 The memcached experiment has two aspects of the \io subsystem it does not exercise, accepting new connections and interacting with disks. 80 115 On the other hand, static webservers, servers that offer static webpages, do stress disk \io since they serve files from disk\footnote{Dynamic webservers, which construct pages as they are sent, are not as interesting since the construction of the pages do not exercise the runtime in a meaningfully different way.}. 81 The static webserver experiments will compare NGINX with a custom webserver developped for this experiment. 116 The static webserver experiments will compare NGINX~\cit{nginx} with a custom webserver developped for this experiment. 117 118 \subsection{\CFA webserver} 119 Unlike the memcached experiment, the webserver experiment relies on a custom designed webserver. 120 It is a simple thread-per-connection webserver where a fixed number of \ats are created upfront. 121 Each of the \at calls @accept@, through @io_uring@, on the listening port and handle the incomming connection once accepted. 122 Most of the implementation is fairly straight forward however the inclusion of file \io introduces a new challenge that had to be hacked around. 123 124 Normally, webservers use @sendfile@\cite{MAN:sendfile} to send files over the socket. 125 @io_uring@ does not support @sendfile@, it supports @splice@\cite{MAN:splice} instead, which is strictly more powerful. 126 However, because of how linux implements file \io, see Subsection~\ref{ononblock}, @io_uring@'s implementation must delegate calls to splice to worker threads inside the kernel. 127 As of Linux 5.13, @io_uring@ caps the numer of these worker threads to @RLIMIT_NPROC@ and therefore, when tens of thousands of splice requests are made, it can create tens of thousands of \glspl{kthrd}. 128 Such a high number of \glspl{kthrd} is more than Linux can handle in this scenario so performance suffers significantly. 129 For this reason, the \CFA webserver calls @sendfile@ directly. 130 This approach works up to a certain point, but once the server approaches saturation, it leads to a new problem. 131 132 When the saturation point of the server is attained, latency will increase and inevitably some client connections will timeout. 133 As these clients close there connections, the server must close these sockets without delay so the OS can reclaim the resources used by these connections. 134 Indeed, until they are closed on the server end, the connection will linger in the CLOSE-WAIT tcp state~\cite{rfc:tcp} and the tcp buffers will be preserved. 135 However, this poses a problem using blocking @sendfile@ calls. 136 The calls can block if they do not have suffcient memory, which can be caused by having too many connections in the CLOSE-WAIT state. 137 Since blocking in calls to @sendfile@ blocks the \proc rather than the \at, this prevents other connections from closing their sockets. 138 This leads to a vicious cycle where timeouts lead to @sendfile@ calls running out of resources, which lead to more timeouts. 139 140 Normally, this is address by marking the sockets as non-blocking and using @epoll@ to wait for sockets to have sufficient resources. 141 However, since @io_uring@ respects non-blocking semantics marking all sockets as non-blocking effectively circumvents the @io_uring@ subsystem entirely. 142 For this reason, the \CFA webserver sets and resets the @O_NONBLOCK@ flag before and after any calls to @sendfile@. 143 Normally @epoll@ would also be used when these calls to @sendfile@ return @EAGAIN@, but since this would not help in the evaluation of the \CFA runtime, the \CFA webserver simply yields and retries in these cases. 144 145 It is important to state that in Linux 5.15 @io_uring@ introduces the ability for users to limit the number of worker threads that are created, through the @IORING_REGISTER_IOWQ_MAX_WORKERS@ option. 146 However, as of writing this document Ubuntu does not have a stable release of Linux 5.15. 147 There exists versions of the kernel that are currently under testing, but these caused unrelated but nevertheless prohibitive issues in this experiment. 148 Presumably, the new kernel would remove the need for the hack described above, as it would allow connections in the CLOSE-WAIT state to be closed even while the calls to @splice@/@sendfile@ are underway. 149 However, since this could not be tested, this is purely a conjecture at this point. 82 150 83 151 \subsection{Benchmark Environment} … … 87 155 These CPUs has only 8 \glspl{hthrd} enabled by grub, which is sufficient to achieve line rate. 88 156 This cpus each have 64 KB, 256 KiB and 8 MB of L1, L2 and L3 caches respectively. 157 The kernel is setup to limit the memory at 25Gb. 89 158 90 159 The client machines each have two 2.8 GHz Xeon CPUs, and four one-gigabit Ethernet cards. … … 95 164 \todo{switch} 96 165 97 98 99 166 \subsection{Throughput} 100 \begin{figure} 101 \centering 102 \input{result.swbsrv.25gb.pstex_t} 103 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline } 167 To measure the throughput of both webservers, each server is loaded with over 30,000 files making over 4.5 Gigabytes in total. 168 Each client runs httperf~\cit{httperf} which establishes a connection, does an http request for one or more files, closes the connection and repeats the process. 169 The connections and requests are made according to a Zipfian distribution~\cite{zipf}. 170 Throughput is measured by aggregating the results from httperf of all the clients. 171 \begin{figure} 172 \subfloat[][Throughput]{ 173 \input{result.swbsrv.25gb.pstex_t} 174 \label{fig:swbsrv:ops} 175 } 176 177 \subfloat[][Rate of Errors]{ 178 \input{result.swbsrv.25gb.err.pstex_t} 179 \label{fig:swbsrv:err} 180 } 181 \caption[Static Webserver Benchmark : Throughput]{Static Webserver Benchmark : Throughput\smallskip\newline Throughput vs request rate for short lived connections connections.} 104 182 \label{fig:swbsrv} 105 183 \end{figure} 106 107 Networked ZIPF 108 109 Nginx : 5Gb still good, 4Gb starts to suffer 110 111 Cforall : 10Gb too high, 4 Gb too low 184 Figure~\ref{fig:swbsrv} shows the results comparing \CFA to NGINX in terms of throughput. 185 These results are fairly straight forward. 186 Both servers achieve the same throughput until around 57,500 requests per seconds. 187 Since the clients are asking for the same files, the fact that the throughput matches exactly is expected as long as both servers are able to serve the desired rate. 188 Once the saturation point is reached, both servers are still very close. 189 NGINX achieves slightly better throughtput. 190 However, Figure~\ref{fig:swbsrv:err} shows the rate of errors, a gross approximation of tail latency, where \CFA achives notably fewet errors once the machine reaches saturation. 191 This suggest that \CFA is slightly more fair and NGINX may sloghtly sacrifice some fairness for improved throughtput. 192 It demonstrate that the \CFA webserver described above is able to match the performance of NGINX up-to and beyond the saturation point of the machine. 193 194 \subsection{Disk Operations} 195 The throughput was made using a server with 25gb of memory, this was sufficient to hold the entire fileset in addition to all the code and data needed to run the webserver and the reste of the machine. 196 Previous work like \cit{Cite Ashif's stuff} demonstrate that an interesting follow-up experiment is to rerun the same throughput experiment but allowing significantly less memory on the machine. 197 If the machine is constrained enough, it will force the OS to evict files from the file cache and cause calls to @sendfile@ to have to read from disk. 198 However, what these low memory experiments demonstrate is how the memory footprint of the webserver affects the performance. 199 However, since what I am to evaluate in this thesis is the runtime of \CFA, I diceded to forgo experiments on low memory server. 200 The implementation of the webserver itself is simply too impactful to be an interesting evaluation of the underlying runtime. -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r741e22c r71cf630 4 4 This chapter presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler. 5 5 6 \section{Benchmark Environment} 6 \section{Benchmark Environment}\label{microenv} 7 7 All benchmarks are run on two distinct hardware platforms. 8 8 \begin{description} … … 32 32 \centering 33 33 \input{cycle.pstex_t} 34 \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \ gls{at} unparks the next \gls{at}in the cycle before parking itself.}34 \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \at unparks the next \at in the cycle before parking itself.} 35 35 \label{fig:cycle} 36 36 \end{figure} 37 37 The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready queue. 38 38 Since these two operation also describe a @yield@ operation, many systems use this operation as the most basic benchmark. 39 However, yielding can be treated as a special case by optimizing it away since the number of ready \glspl{at}does not change.40 Not all systems perform this optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}.39 However, yielding can be treated as a special case and some aspects of the scheduler can be optimized away since the number of ready \ats does not change. 40 Not all systems perform this type of optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}. 41 41 For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}. 42 This benchmark arranges a number of \ glspl{at}into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.43 At runtime, each \ gls{at} unparks the next \gls{at}before parking itself.44 Unparking the next \ gls{at} pushes that \gls{at} onto the ready queue as does the ensuing park.45 46 Hence, the underlying runtime cannot rely on the number of ready \ glspl{at}staying constant over the duration of the experiment.47 In fact, the total number of \ glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at}parking.42 This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list. 43 At runtime, each \at unparks the next \at before parking itself. 44 Unparking the next \at pushes that \at onto the ready queue while the ensuing park leads to a \at being popped from the ready queue. 45 46 Hence, the underlying runtime cannot rely on the number of ready \ats staying constant over the duration of the experiment. 47 In fact, the total number of \ats waiting on the ready queue is expected to vary because of the delay between the next \at unparking and the current \at parking. 48 48 That is, the runtime cannot anticipate that the current task will immediately park. 49 As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \gls{at} parks because of time-slicing or multiple \procs. 50 Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops. 51 To prevent any attempt of silently omitting ready-queue operations, the ring of \glspl{at} is made big enough so the \glspl{at} have time to fully park before being unparked again. 52 (Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.) 49 As well, the size of the cycle is also decided based on this delay. 50 Note that, an unpark is like a V on a semaphore, so the subsequent park (P) may not block. 51 If this happens, the scheduler push and pop are avoided and the results of the experiment would be skewed. 52 Because of time-slicing or because cycles can be spread over multiple \procs, a small cycle may see the chain of unparks go full circle before the first \at parks. 53 Every runtime system must handle this race and but cannot optimized away the ready-queue pushes and pops if the cycle is long enough. 54 To prevent any attempt of silently omitting ready-queue operations, the ring of \ats is made big enough so the \ats have time to fully park before being unparked again. 53 55 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment. 54 55 To avoid this benchmark being affected by idle-sleep handling, the number of rings is multiple times greater than the number of \glspl{proc}.56 This design avoids the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentioned above.57 56 58 57 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark. … … 64 63 count := 0 65 64 for { 65 @this.next.wake()@ 66 66 @wait()@ 67 @this.next.wake()@68 67 count ++ 69 68 if must_stop() { break } … … 103 102 \label{fig:cycle:jax:low:ns} 104 103 } 105 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. }104 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 106 105 \label{fig:cycle:jax} 107 106 \end{figure} … … 125 124 \input{result.cycle.nasus.ns.pstex_t} 126 125 } 127 126 \label{fig:cycle:nasus:ns} 128 127 } 129 128 \subfloat[][Scalability, 1 cycle per \proc]{ … … 133 132 \label{fig:cycle:nasus:low:ns} 134 133 } 135 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. }134 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count 5 \ats per cycle and different cycle count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 136 135 \label{fig:cycle:nasus} 137 136 \end{figure} 138 137 Figure~\ref{fig:cycle:jax} and Figure~\ref{fig:cycle:nasus} shows the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats. 139 138 The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row. 140 Where scalability uses the same data but the Y axis is calculated as th roughput over the number of \procs.139 Where scalability uses the same data but the Y axis is calculated as the number of \procs over the throughput. 141 140 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same. 142 141 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy. … … 144 143 The distinction is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily. 145 144 145 The experiment was run 15 times for each series and processor count and the \emph{$\times$}s on the graph show all of the results obtained. 146 Each series also has a solid and two dashed lines highlighting the median, maximum and minimum result respectively. 147 This presentation offers an overview of the distribution of the results for each series. 148 149 The experimental setup uses taskset to limit the placement of \glspl{kthrd} by the operating system. 150 As mentioned in Section~\ref{microenv}, the experiement is setup to prioritize running on 2 \glspl{hthrd} per core before running on multiple sockets. 151 For the Intel machine, this means that from 1 to 24 \procs, one socket and \emph{no} hyperthreading is used and from 25 to 48 \procs, still only one socket but \emph{with} hyperthreading. 152 This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192. 153 On AMD, the same algorithm is used, but the machine only has 2 sockets. 154 So hyperthreading\footnote{Hyperthreading normally refers specifically to the technique used by Intel, however here it is loosely used to refer to AMD's equivalent feature.} is used when the \proc count reach 65 and 193. 155 156 Figure~\ref{fig:cycle:jax:ops} and Figure~\ref{fig:cycle:jax:ns} show that for 100 cycles per \proc, \CFA, Go and Tokio all obtain effectively the same performance. 157 Libfibre is slightly behind in this case but still scales decently. 158 As a result of the \gls{kthrd} placement, we can see that additional \procs from 25 to 48 offer less performance improvements for all runtimes. 159 As expected, this pattern repeats between \proc count 72 and 96. 146 160 The performance goal of \CFA is to obtain equivalent performance to other, less fair schedulers and that is what results show. 147 161 Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes. 148 The experimental setup prioritizes running on 2 \glspl{hthrd} per core before running on multiple sockets. 149 The effect of that setup is seen from 25 to 48 \procs, running on 24 core with 2 \glspl{hthrd} per core. 150 This effect is again repeated from 73 and 96 \procs, where it happens on the second CPU. 151 When running only a single cycle, most runtime achieve lower throughput because of the idle-sleep mechanism. 152 In Figure~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} 153 154 Figure~\ref{fig:cycle:nasus} show effectively the same story happening on AMD as it does on Intel. 155 The different performance bumps due to cache topology happen at different locations and there is a little more variability. 156 However, in all cases \CFA is still competitive with other runtimes. 157 162 163 When running only a single cycle, the story is slightly different. 164 \CFA and tokio obtain very smiliar results overall, but tokio shows notably more variations in the results. 165 While \CFA, Go and tokio achive equivalent performance with 100 cycles per \proc, with only 1 cycle per \proc Go achieves slightly better performance. 166 This difference in throughput and scalability is due to the idle-sleep mechanism. 167 With very few cycles, stealing or helping can cause a cascade of tasks migration and trick \proc into very short idle sleeps. 168 Both effect will negatively affect performance. 169 170 An interesting and unusual result is that libfibre achieves better performance with fewer cycle. 171 This suggest that the cascade effect is never present in libfibre and that some bottleneck disappears in this context. 172 However, I did not investigate this result any deeper. 173 174 Figure~\ref{fig:cycle:nasus} show a similar story happening on AMD as it does on Intel. 175 The different performance improvements and plateaus due to cache topology appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. 176 Unlike Intel, on AMD all 4 runtimes achieve very similar throughput and scalability for 100 cycles per \proc. 177 178 In the 1 cycle per \proc experiment, the same performance increase for libfibre is visible. 179 However, unlike on Intel, tokio achieves the same performance as Go rather than \CFA. 180 This leaves \CFA trailing behind in this particular case, but only at hight core counts. 181 Presumably this is because in this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 182 Since this effect is only problematic in cases with 1 \at per \proc it is not very meaningful for the general performance. 183 184 The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance in this scenario. 185 Which demonstrate that in this case \CFA achieves equivalent performance. 158 186 159 187 \section{Yield} 160 188 For completion, the classic yield benchmark is included. 161 This benchmark is simpler than the cycle test: it creates many \ glspl{at}that call @yield@.189 This benchmark is simpler than the cycle test: it creates many \ats that call @yield@. 162 190 As mentioned, this benchmark may not be representative because of optimization shortcuts in @yield@. 163 The only interesting variable in this benchmark is the number of \ glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) can be empty.191 The only interesting variable in this benchmark is the number of \ats per \procs, where ratios close to 1 means the ready queue(s) can be empty. 164 192 This scenario can put a strain on the idle-sleep handling compared to scenarios where there is plenty of work. 165 193 Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, where the @wait/next.wake@ is replaced by @yield@. … … 208 236 \label{fig:yield:jax:low:ns} 209 237 } 210 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. }238 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 211 239 \label{fig:yield:jax} 212 240 \end{figure} … … 230 258 \input{result.yield.nasus.ns.pstex_t} 231 259 } 232 260 \label{fig:yield:nasus:ns} 233 261 } 234 262 \subfloat[][Scalability, 1 \at per \proc]{ … … 238 266 \label{fig:yield:nasus:low:ns} 239 267 } 240 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. }268 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count, using 1 \ats per \proc. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 241 269 \label{fig:yield:nasus} 242 270 \end{figure} 243 244 Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count, where each run uses 100 \ats per \proc. 271 Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count on Intel. 245 272 It is fairly obvious why I claim this benchmark is more artificial. 246 273 The throughput is dominated by the mechanism used to handle the @yield@. 247 \CFA does not have special handling for @yield@ and achieves very similar performance to the cycle benchmark. 248 Libfibre uses the fact that @yield@ doesn't change the number of ready fibres and by-passes the idle-sleep mechanism entirely, producing significantly better throughput. 249 Go puts yielding goroutines on a secondary global ready-queue, giving them lower priority. 250 The result is that multiple \glspl{hthrd} contend for the global queue and performance suffers drastically. 251 Based on the scalability, Tokio obtains the same poor performance and therefore it is likely it handles @yield@ in a similar fashion. 274 \CFA does not have special handling for @yield@ but the experiment requires less synchronization. 275 As a result achieves better performance than the cycle benchmark, but still comparable. 252 276 253 277 When the number of \ats is reduce to 1 per \proc, the cost of idle sleep also comes into play in a very significant way. 254 If anything causes a \at migration, where two \ats end-up on the same ready-queue, work-stealing will start occuring and c ause every \atto shuffle around.278 If anything causes a \at migration, where two \ats end-up on the same ready-queue, work-stealing will start occuring and could cause several \ats to shuffle around. 255 279 In the process, several \procs can go to sleep transiently if they fail to find where the \ats were shuffled to. 256 280 In \CFA, spurious bursts of latency can trick a \proc into helping, triggering this effect. 257 However, since user-level threading with equal number of \ats and \procs is a somewhat degenerate case, especially when ctxswitching very often, this result is not particularly meaningful and is only included for completness. 281 However, since user-level threading with equal number of \ats and \procs is a somewhat degenerate case, especially when context-switching very often, this result is not particularly meaningful and is only included for completness. 282 283 Libfibre uses the fact that @yield@ doesn't change the number of ready fibres and by-passes the idle-sleep mechanism entirely, producing significantly better throughput. 284 Additionally, when only running 1 \at per \proc, libfibre optimizes further and forgoes the context-switch entirely. 285 This results in incredible performance results comparing to the other runtimes. 286 287 In stark contrast with libfibre, Go puts yielding goroutines on a secondary global ready-queue, giving them lower priority. 288 The result is that multiple \glspl{hthrd} contend for the global queue and performance suffers drastically. 289 Based on the scalability, Tokio obtains the similarly poor performance and therefore it is likely it handles @yield@ in a similar fashion. 290 However, it must be doing something different since it does scale at low \proc count. 258 291 259 292 Again, Figure~\ref{fig:yield:nasus} show effectively the same story happening on AMD as it does on Intel. … … 265 298 \section{Churn} 266 299 The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application. 267 In these benchmarks, \ glspl{at} can be easily partitioned over the different \glspl{proc} upfront and none of the \glspl{at}communicate with each other.268 269 The Churn benchmark represents more chaotic execution , where there is no relation between the last \gls{proc} on which a \gls{at} ran and blocked and the \gls{proc}that subsequently unblocks it.270 With processor-specific ready-queues, when a \ gls{at} is unblocked by a different \gls{proc} that means the unblocking \gls{proc} must either ``steal'' the \gls{at} from another processor or find it on a globalqueue.271 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on \gls{at}data structure.272 In either case, this benchmark aims to highlight howeach scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.300 In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other. 301 302 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no apparent relation between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. 303 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or place it on a remote queue. 304 This enqueuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. 305 In either case, this benchmark aims to measure how well each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. 273 306 274 307 This benchmark uses a fixed-size array of counting semaphores. 275 Each \ gls{at}picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.276 This creates a flow where \ glspl{at}push each other out of the semaphores before being pushed out themselves.277 For this benchmark to work, the number of \ glspl{at} must be equal or greater than the number of semaphores plus the number of \glspl{proc}.308 Each \at picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore. 309 This creates a flow where \ats push each other out of the semaphores before being pushed out themselves. 310 For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs. 278 311 Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@. 279 312 Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@. … … 298 331 299 332 \subsection{Results} 300 Figure~\ref{fig:churn:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle.301 302 333 \begin{figure} 303 334 \subfloat[][Throughput, 100 \ats per \proc]{ … … 307 338 \label{fig:churn:jax:ops} 308 339 } 309 \subfloat[][Throughput, 1\ats per \proc]{340 \subfloat[][Throughput, 2 \ats per \proc]{ 310 341 \resizebox{0.5\linewidth}{!}{ 311 342 \input{result.churn.low.jax.ops.pstex_t} … … 318 349 \input{result.churn.jax.ns.pstex_t} 319 350 } 320 321 } 322 \subfloat[][Latency, 1\ats per \proc]{351 \label{fig:churn:jax:ns} 352 } 353 \subfloat[][Latency, 2 \ats per \proc]{ 323 354 \resizebox{0.5\linewidth}{!}{ 324 355 \input{result.churn.low.jax.ns.pstex_t} … … 326 357 \label{fig:churn:jax:low:ns} 327 358 } 328 \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. 329 Throughput is the total operation per second across all cores. Latency is the duration of each operation.} 359 \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 330 360 \label{fig:churn:jax} 331 361 \end{figure} 332 362 333 \todo{results discussion} 363 \begin{figure} 364 \subfloat[][Throughput, 100 \ats per \proc]{ 365 \resizebox{0.5\linewidth}{!}{ 366 \input{result.churn.nasus.ops.pstex_t} 367 } 368 \label{fig:churn:nasus:ops} 369 } 370 \subfloat[][Throughput, 2 \ats per \proc]{ 371 \resizebox{0.5\linewidth}{!}{ 372 \input{result.churn.low.nasus.ops.pstex_t} 373 } 374 \label{fig:churn:nasus:low:ops} 375 } 376 377 \subfloat[][Latency, 100 \ats per \proc]{ 378 \resizebox{0.5\linewidth}{!}{ 379 \input{result.churn.nasus.ns.pstex_t} 380 } 381 \label{fig:churn:nasus:ns} 382 } 383 \subfloat[][Latency, 2 \ats per \proc]{ 384 \resizebox{0.5\linewidth}{!}{ 385 \input{result.churn.low.nasus.ns.pstex_t} 386 } 387 \label{fig:churn:nasus:low:ns} 388 } 389 \caption[Churn Benchmark on AMD]{\centering Churn Benchmark on AMD\smallskip\newline Throughput and latency of the Churn on the benchmark on the AMD machine. 390 For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 391 \label{fig:churn:nasus} 392 \end{figure} 393 Figure~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput as a function of \proc count on Intel and AMD respectively. 394 It uses the same representation as the previous benchmark : 15 runs where the dashed line show the extremums and the solid line the median. 395 The performance cost of crossing the cache boundaries is still visible at the same \proc count. 396 However, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the eachother's data. 397 Scalability is notably worst than the previous benchmarks since there is inherently more communication between processors. 398 Indeed, once the number of \glspl{hthrd} goes beyond a single socket, performance ceases to improve. 399 An interesting aspect to note here is that the runtimes differ in how they handle this situation. 400 Indeed, when a \proc unparks a \at that was last run on a different \proc, the \at could be appended to the ready-queue local \proc or to the ready-queue of the remote \proc, which previously ran the \at. 401 \CFA, tokio and Go all use the approach of unparking to the local \proc while Libfibre unparks to the remote \proc. 402 In this particular benchmark, the inherent chaos of the benchmark in addition to small memory footprint means neither approach wins over the other. 403 404 Like for the cycle benchmark, here all runtimes achieve fairly similar performance. 405 Performance improves as long as all \procs fit on a single socket. 406 Beyond that performance starts to suffer from increased caching costs. 407 408 Indeed on Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show that with 1 and 100 \ats per \proc, \CFA, libfibre, Go and tokio achieve effectively equivalent performance for most \proc count. 409 410 However, Figure~\ref{fig:churn:nasus} again shows a somewhat different story on AMD. 411 While \CFA, libfibre, and tokio achieve effectively equivalent performance for most \proc count, Go starts with better scaling at very low \proc counts but then performance quickly plateaus, resulting in worse performance at higher \proc counts. 412 This performance difference is visible at both high and low \at counts. 413 414 One possible explanation for this difference is that since Go has very few available concurrent primitives, a channel was used instead of a semaphore. 415 On paper a semaphore can be replaced by a channel and with zero-sized objects passed along equivalent performance could be expected. 416 However, in practice there can be implementation difference between the two. 417 This is especially true if the semaphore count can get somewhat high. 418 Note that this replacement is also made in the cycle benchmark, however in that context it did not seem to have a notable impact. 419 420 As second possible explanation is that Go may sometimes use the heap when allocating variables based on the result of escape analysis of the code. 421 It is possible that variables that should be placed on the stack are placed on the heap. 422 This could cause extra pointer chasing in the benchmark, heightning locality effects. 423 Depending on how the heap is structure, this could also lead to false sharing. 424 425 The objective of this benchmark is to demonstrate that unparking \ats from remote \procs do not cause too much contention on the local queues. 426 Indeed, the fact all runtimes achieve some scaling at lower \proc count demontrate that migrations do not need to be serialized. 427 Again these result demonstrate \CFA achieves satisfactory performance. 334 428 335 429 \section{Locality} 336 337 \todo{code, setup, results} 430 \begin{figure} 431 \begin{cfa} 432 Thread.main() { 433 count := 0 434 for { 435 r := random() % len(spots) 436 // go through the array 437 @work( a )@ 438 spots[r].V() 439 spots[r].P() 440 count ++ 441 if must_stop() { break } 442 } 443 global.count += count 444 } 445 \end{cfa} 446 \begin{cfa} 447 Thread.main() { 448 count := 0 449 for { 450 r := random() % len(spots) 451 // go through the array 452 @work( a )@ 453 // pass array to next thread 454 spots[r].V( @a@ ) 455 @a = @spots[r].P() 456 count ++ 457 if must_stop() { break } 458 } 459 global.count += count 460 } 461 \end{cfa} 462 \caption[Locality Benchmark : Pseudo Code]{Locality Benchmark : Pseudo Code} 463 \label{fig:locality:code} 464 \end{figure} 465 As mentionned in the churn benchmark, when unparking a \at, it is possible to either unpark to the local or remote ready-queue. 466 \footnote{It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, there is little to suggest this would not degrade performance.} 467 The locality experiment includes two variations of the churn benchmark, where an array of data is added. 468 In both variations, before @V@ing the semaphore, each \at increment random cells inside the array. 469 The @share@ variation then passes the array to the shadow-queue of the semaphore, transferring ownership of the array to the woken thread. 470 In the @noshare@ variation the array is not passed on and each thread continously accesses its private array. 471 472 The objective here is to highlight the different decision made by the runtime when unparking. 473 Since each thread unparks a random semaphore, it means that it is unlikely that a \at will be unparked from the last \proc it ran on. 474 In the @share@ version, this means that unparking the \at on the local \proc is appropriate since the data was last modified on that \proc. 475 In the @noshare@ version, the unparking the \at on the remote \proc is the appropriate approach. 476 477 The expectation for this benchmark is to see a performance inversion, where runtimes will fare notably better in the variation which matches their unparking policy. 478 This should lead to \CFA, Go and Tokio achieving better performance in @share@ while libfibre achieves better performance in @noshare@. 479 Indeed, \CFA, Go and Tokio have the default policy of unpark \ats on the local \proc, where as libfibre has the default policy of unparks \ats wherever they last ran. 480 481 \subsection{Results} 482 \begin{figure} 483 \subfloat[][Throughput share]{ 484 \resizebox{0.5\linewidth}{!}{ 485 \input{result.locality.share.jax.ops.pstex_t} 486 } 487 \label{fig:locality:jax:share:ops} 488 } 489 \subfloat[][Throughput noshare]{ 490 \resizebox{0.5\linewidth}{!}{ 491 \input{result.locality.noshare.jax.ops.pstex_t} 492 } 493 \label{fig:locality:jax:noshare:ops} 494 } 495 496 \subfloat[][Scalability share]{ 497 \resizebox{0.5\linewidth}{!}{ 498 \input{result.locality.share.jax.ns.pstex_t} 499 } 500 \label{fig:locality:jax:share:ns} 501 } 502 \subfloat[][Scalability noshare]{ 503 \resizebox{0.5\linewidth}{!}{ 504 \input{result.locality.noshare.jax.ns.pstex_t} 505 } 506 \label{fig:locality:jax:noshare:ns} 507 } 508 \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and Scalability as a function of \proc count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 509 \label{fig:locality:jax} 510 \end{figure} 511 \begin{figure} 512 \subfloat[][Throughput share]{ 513 \resizebox{0.5\linewidth}{!}{ 514 \input{result.locality.share.nasus.ops.pstex_t} 515 } 516 \label{fig:locality:nasus:share:ops} 517 } 518 \subfloat[][Throughput noshare]{ 519 \resizebox{0.5\linewidth}{!}{ 520 \input{result.locality.noshare.nasus.ops.pstex_t} 521 } 522 \label{fig:locality:nasus:noshare:ops} 523 } 524 525 \subfloat[][Scalability share]{ 526 \resizebox{0.5\linewidth}{!}{ 527 \input{result.locality.share.nasus.ns.pstex_t} 528 } 529 \label{fig:locality:nasus:share:ns} 530 } 531 \subfloat[][Scalability noshare]{ 532 \resizebox{0.5\linewidth}{!}{ 533 \input{result.locality.noshare.nasus.ns.pstex_t} 534 } 535 \label{fig:locality:nasus:noshare:ns} 536 } 537 \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and Scalability as a function of \proc count. For Throughput higher is better, for Scalability lower is better. Each series represent 15 independent runs, the dotted lines are extremums while the solid line is the medium.} 538 \label{fig:locality:nasus} 539 \end{figure} 540 541 Figure~\ref{fig:locality:jax} and \ref{fig:locality:nasus} shows the results on Intel and AMD respectively. 542 In both cases, the graphs on the left column show the results for the @share@ variation and the graphs on the right column show the results for the @noshare@. 543 544 On Intel, Figure~\ref{fig:locality:jax} shows Go trailing behind the 3 other runtimes. 545 On the left of the figure showing the results for the shared variation, where \CFA and tokio slightly outperform libfibre as expected. 546 And correspondingly on the right, we see the expected performance inversion where libfibre now outperforms \CFA and tokio. 547 Otherwise the results are similar to the churn benchmark, with lower throughtput due to the array processing. 548 Presumably the reason why Go trails behind are the same as in Figure~\ref{fig:churn:nasus}. 549 550 Figure~\ref{fig:locality:nasus} shows the same experiment on AMD. 551 \todo{why is cfa slower?} 552 Again, we see the same story, where tokio and libfibre swap places and Go trails behind. 338 553 339 554 \section{Transfer} 340 555 The last benchmark is more of an experiment than a benchmark. 341 556 It tests the behaviour of the schedulers for a misbehaved workload. 342 In this workload, one of the \ gls{at}is selected at random to be the leader.343 The leader then spins in a tight loop until it has observed that all other \ glspl{at}have acknowledged its leadership.344 The leader \ gls{at} then picks a new \gls{at} to be the ``spinner''and the cycle repeats.345 The benchmark comes in two flavours for the non-leader \ glspl{at}:557 In this workload, one of the \at is selected at random to be the leader. 558 The leader then spins in a tight loop until it has observed that all other \ats have acknowledged its leadership. 559 The leader \at then picks a new \at to be the next leader and the cycle repeats. 560 The benchmark comes in two flavours for the non-leader \ats: 346 561 once they acknowledged the leader, they either block on a semaphore or spin yielding. 347 562 348 563 The experiment is designed to evaluate the short-term load-balancing of a scheduler. 349 Indeed, schedulers where the runnable \ glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at}for this experiment to terminate.350 This problem occurs because the spinning \ gls{at} is effectively preventing the \gls{proc} from running any other \glspl{thrd}.351 In the semaphore flavour, the number of runnable \ glspl{at}eventually dwindles down to only the leader.352 This scenario is a simpler case to handle for schedulers since \ glspl{proc}eventually run out of work.353 In the yielding flavour, the number of runnable \ glspl{at}stays constant.564 Indeed, schedulers where the runnable \ats are partitioned on the \procs may need to balance the \ats for this experiment to terminate. 565 This problem occurs because the spinning \at is effectively preventing the \proc from running any other \at. 566 In the semaphore flavour, the number of runnable \ats eventually dwindles down to only the leader. 567 This scenario is a simpler case to handle for schedulers since \procs eventually run out of work. 568 In the yielding flavour, the number of runnable \ats stays constant. 354 569 This scenario is a harder case to handle because corrective measures must be taken even when work is available. 355 570 Note, runtime systems with preemption circumvent this problem by forcing the spinner to yield. 356 571 357 \todo{code, setup, results} 572 In both flavours, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point. 573 In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by: 574 $ \frac{CSL + SL}{NP - 1}$, where $CSL$ is the context switch latency, $SL$ is the cost for enqueuing and dequeuing a \at and $NP$ is the number of \procs. 575 However, if the scheduler allows \ats to run many times before other \ats are able to run once, this delay will increase. 576 The semaphore version is an approximation of the strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once. 577 The benchmark effectively provides the fairness guarantee in this case. 578 In the yielding version however, the benchmark provides no such guarantee, which means the scheduler has full responsability and any unfairness will be measurable. 579 580 While this is a fairly artificial scenario, it requires only a few simple pieces. 581 The yielding version of this simply creates a scenario where a \at runs uninterrupted in a saturated system, and starvation has an easily measured impact. 582 However, \emph{any} \at that runs uninterrupted for a significant period of time in a saturated system could lead to this kind of starvation. 358 583 359 584 \begin{figure} … … 365 590 return 366 591 } 367 368 592 // Wait for everyone to acknowledge my leadership 369 593 start: = timeNow() … … 374 598 } 375 599 } 376 377 600 // pick next leader 378 601 leader := threads[ prng() % len(threads) ] 379 380 602 // wake every one 381 603 if ! exhaust { … … 385 607 } 386 608 } 387 388 609 Thread.wait() { 389 610 this.idx_seen := lead_idx … … 391 612 else { yield() } 392 613 } 393 394 614 Thread.main() { 395 615 while !done { … … 404 624 405 625 \subsection{Results} 406 Figure~\ref{fig:transfer:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle. 407 408 \todo{results discussion} 626 \begin{figure} 627 \begin{centering} 628 \begin{tabular}{r | c c c c | c c c c } 629 Machine & \multicolumn{4}{c |}{Intel} & \multicolumn{4}{c}{AMD} \\ 630 Variation & \multicolumn{2}{c}{Park} & \multicolumn{2}{c |}{Yield} & \multicolumn{2}{c}{Park} & \multicolumn{2}{c}{Yield} \\ 631 \procs & 2 & 192 & 2 & 192 & 2 & 256 & 2 & 256 \\ 632 \hline 633 \CFA & 106 $\mu$s & ~19.9 ms & 68.4 $\mu$s & ~1.2 ms & 174 $\mu$s & ~28.4 ms & 78.8~~$\mu$s& ~~1.21 ms \\ 634 libfibre & 127 $\mu$s & ~33.5 ms & DNC & DNC & 156 $\mu$s & ~36.7 ms & DNC & DNC \\ 635 Go & 106 $\mu$s & ~64.0 ms & 24.6 ms & 74.3 ms & 271 $\mu$s & 121.6 ms & ~~1.21~ms & 117.4 ms \\ 636 tokio & 289 $\mu$s & 180.6 ms & DNC & DNC & 157 $\mu$s & 111.0 ms & DNC & DNC 637 \end{tabular} 638 \end{centering} 639 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader. } 640 \label{fig:transfer:res} 641 \end{figure} 642 Figure~\ref{fig:transfer:res} shows the result for the transfer benchmark with 2 \procs and all \procs, where each experiement runs 100 \at per \proc. 643 Note that the results here are only meaningful as a coarse measurement of fairness, beyond which small cost differences in the runtime and concurrent primitives begin to matter. 644 As such, data points that are the on the same order of magnitude as eachother should be basically considered equal. 645 The takeaway of this experiement is the presence of very large differences. 646 The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged. 647 The yielding variation is denoted ``Yield''. 648 The experiement was only run for the extremums of the number of cores since the scaling per core behaves like previous experiements. 649 This experiments clearly demonstrate that while the other runtimes achieve similar performance in previous benchmarks, here \CFA achieves significantly better fairness. 650 The semaphore variation serves as a control group, where all runtimes are expected to transfer leadership fairly quickly. 651 Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running leader. 652 Figure~\ref{fig:transfer:res} shows that while Go and Tokio are slower, all runtime achieve decent latency. 653 However, the yielding variation shows an entirely different picture. 654 Since libfibre and tokio have a traditional work-stealing scheduler, \procs that have \ats on their local queues will never steal from other \procs. 655 The result is that the experiement simply does not complete for these runtime. 656 Without \procs stealing from the \proc running the leader, the experiment will simply never terminate. 657 Go manages to complete the experiement because it adds preemption on top of classic work-stealing. 658 However, since preemption is fairly costly it achieves significantly worst performance. 659 In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness. 660 Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle-sleep. -
doc/theses/thierry_delisle_PhD/thesis/text/existing.tex
r741e22c r71cf630 14 14 15 15 \section{Naming Convention} 16 Scheduling has been studied by various communities concentrating on different incarnation of the same problems. 17 As a result, there are no standard naming conventions for scheduling that is respected across these communities. 16 Scheduling has been studied by various communities concentrating on different incarnation of the same problems. 17 As a result, there are no standard naming conventions for scheduling that is respected across these communities. 18 18 This document uses the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the concrete objects executing these \ats. 19 19 … … 28 28 \section{Dynamic Scheduling} 29 29 \newterm{Dynamic schedulers} determine \ats dependencies and costs during scheduling, if at all. 30 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime. 30 Hence, unlike static scheduling, \ats dependencies are conditional and detected at runtime. 31 31 This detection takes the form of observing new \ats(s) in the system and determining dependencies from their behaviour, including suspending or halting a \ats that dynamically detects unfulfilled dependencies. 32 32 Furthermore, each \ats has the responsibility of adding dependent \ats back into the system once dependencies are fulfilled. … … 51 51 Most common operating systems use some variant on priorities with overlaps and dynamic priority adjustments. 52 52 For example, Microsoft Windows uses a pair of priorities 53 \cit {https://docs.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities,https://docs.microsoft.com/en-us/windows/win32/taskschd/taskschedulerschema-priority-settingstype-element}, one specified by users out of ten possible options and one adjusted by the system.53 \cite{win:priority}, one specified by users out of ten possible options and one adjusted by the system. 54 54 55 55 \subsection{Uninformed and Self-Informed Dynamic Schedulers} … … 137 137 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. 138 138 139 \todo{load balancing} 139 In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth. 140 Multicore scheduling is based on a combination of priorities, preferred \proc. 141 Each \at is assigned an \newterm{ideal} \proc using a round-robin policy. 142 \Gls{at} are distributed among the \procs according to their priority, preferring to match \ats to their ideal \proc and then to the last \proc they ran on. 143 This is similar to a variation of work stealing, where the stealing \proc restore the \at to its original \proc after running it, but with priorities added onto the mix. 140 144 141 145 \paragraph{Apple OS X} … … 156 160 \paragraph{Go}\label{GoSafePoint} 157 161 Go's scheduler uses a randomized work-stealing algorithm that has a global run-queue (\emph{GRQ}) and each processor (\emph{P}) has both a fixed-size run-queue (\emph{LRQ}) and a high-priority next ``chair'' holding a single element~\cite{GITHUB:go,YTUBE:go}. 158 Preemption is present, but only at safe-points,~\cit {https://go.dev/src/runtime/preempt.go} which are inserted detection code at various frequent access boundaries.162 Preemption is present, but only at safe-points,~\cite{go:safepoints} which are inserted detection code at various frequent access boundaries. 159 163 160 164 The algorithm is as follows : … … 199 203 200 204 \paragraph{Grand Central Dispatch} 201 An Apple\cit {Official GCD source} API that offers task parallelism~\cite{wiki:taskparallel}.205 An Apple\cite{apple:gcd} API that offers task parallelism~\cite{wiki:taskparallel}. 202 206 Its distinctive aspect is multiple ``Dispatch Queues'', some of which are created by programmers. 203 207 Each queue has its own local ordering guarantees, \eg \ats on queue $A$ are executed in \emph{FIFO} order. 204 208 205 \todo{load balancing and scheduling} 206 207 % http://web.archive.org/web/20090920043909/http://images.apple.com/macosx/technology/docs/GrandCentral_TB_brief_20090903.pdf 208 209 In terms of semantics, the Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics. 209 While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests an approach fairly classic; 210 Where each \proc has a queue of \newterm{blocks} to run, effectively \ats, and they drain their respective queues in \glsxtrshort{fifo}. 211 They seem to add the concept of dependent queues with clear ordering, where a executing a block ends-up scheduling more blocks. 212 In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB @execute()@ and predecessor semantics. 210 213 211 214 \paragraph{LibFibre} -
doc/theses/thierry_delisle_PhD/thesis/text/front.tex
r741e22c r71cf630 106 106 % D E C L A R A T I O N P A G E 107 107 % ------------------------------- 108 % The following is a sample De laration Page as provided by the GSO108 % The following is a sample Declaration Page as provided by the GSO 109 109 % December 13th, 2006. It is designed for an electronic thesis. 110 110 \noindent … … 124 124 125 125 User-Level threading (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. 126 The user -levelapproach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.127 Indeed, over-partitioning into small work-units significantly eases load balancing while providing user threads for each unit of work offers greater freedom to the programmer.126 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems. 127 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 128 128 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 129 which begs of the question of how many kernel threads are needed and when should the need be re-evaliated. 130 Furthermore, the scheduler must prevent kernel threads from blocking, otherwise user-thread parallelism drops, and put idle kernel-threads to sleep to avoid wasted resources. 129 which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated. 130 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 131 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources. 131 132 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 132 otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur .133 otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. 133 134 134 135 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for user-level threading. 135 The predominant technique for manage high levels of concurrency is sharding the ready-queue with one queue per kernel-threads and using some form of work stealing/sharing to dynamically rebalance workload shifts. 136 Fairness can be handled through preemption or ad-hoc solutions, which leads to coarse-grained fairness and pathological cases. 136 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. 137 137 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. 138 139 After selecting specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system. 138 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases. 139 140 After examining, selecting and testing specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system. 140 141 \CFA is a modern extension of C using user-level threading as its fundamental threading model. 141 142 As one of its primary goals, \CFA aims to offer increased safety and productivity without sacrificing performance. 142 143 The new scheduler achieves this goal by demonstrating equivalent performance to work-stealing schedulers while offering better fairness. 143 This is achieved through several optimization that successfully eliminate the cost of the additional fairness, some of these optimization relying on interesting hardware optimizations present on most modern cpus. 144 This work also includes support for user-level \io, allowing programmers to have many more user-threads blocking on \io operations than there are \glspl{kthrd}. 145 The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness. 146 To complete the picture, the idle sleep mechanism that goes along is presented. 147 148 144 The implementation uses several optimizations that successfully balance the cost of fairness against performance; 145 some of these optimizations rely on interesting hardware optimizations present on modern CPUs. 146 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 147 The implementation is based on @io_uring@, a recent addition to the Linux kernel, and achieves the same performance and fairness as systems using @select@, @epoll@, \etc. 148 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application. 149 149 150 150 \cleardoublepage … … 155 155 \begin{center}\textbf{Acknowledgements}\end{center} 156 156 157 \todo{Acknowledgements} 157 I would like to thank my supervisor, Professor Peter Buhr, for his guidance through my degree as well as the editing of this document. 158 159 I would like to thank Professors Martin Karsten and Trevor Brown, for reading my thesis and providing helpful feedback. 160 161 Thanks to Andrew Beach, Michael Brooks, Colby Parsons, Mubeen Zulfiqar, Fangren Yu and Jiada Liang for their work on the \CFA project as well as all the discussions which have helped me concretize the ideas in this thesis. 162 163 Finally, I acknowledge that this has been possible thanks to the financial help offered by the David R. Cheriton School of Computer Science and the corporate partnership with Huawei Ltd. 158 164 \cleardoublepage 159 165 -
doc/theses/thierry_delisle_PhD/thesis/text/intro.tex
r741e22c r71cf630 1 1 \chapter{Introduction}\label{intro} 2 \section{\CFA programming language}3 2 4 The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. 5 Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs. 6 My previous master's thesis on concurrent in \CFA focused on features and interfaces. 7 This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. 8 Specifically, this work concentrates on scheduling and \glsxtrshort{io}. 9 Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}. 3 \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. 4 The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems. 5 Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. 6 To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; 7 which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated. 8 Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. 9 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources. 10 Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread; 11 otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads. 10 12 11 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. 12 While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work. 13 This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}. 14 The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. 15 Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. 16 Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases. 17 18 After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system. 19 The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. 20 The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness. 21 22 Chapter~\ref{intro} defines scheduling and its general goals. 23 Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others. 24 Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation. 25 Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness. 26 Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}. 27 Chapter~\ref{practice} presents the mechanisms needed to adjust the amount of parallelism, both manually and automatically. 28 Chapters~\ref{microbench} and~\ref{macrobench} present micro and macro benchmarks used to evaluate and compare the new scheduler with similar schedulers. 29 13 30 14 31 \section{Scheduling} 15 Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones.16 On a computer system with multiple processors and work units , there exists the problem of mapping work ontoprocessors in an efficient manner, called \newterm{scheduling}.17 These systems are normally \newterm{open}, meaning new work arrives from an external source or isspawned from an existing work unit.18 On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.19 A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.20 With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}.21 However, optimal solutions are often not required.22 Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns.32 Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones. 33 On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}. 34 Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit. 35 In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. 36 For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization. 37 A general-purpose dynamic-scheduler for an open system cannot anticipate work requests, so its performance is rarely optimal. 38 Even with complete knowledge of arrive order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}. 39 However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns. 23 40 24 41 Scheduling occurs at discreet points when there are transitions in a system. … … 27 44 \input{executionStates.pstex_t} 28 45 \end{center} 29 These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):46 These \newterm{state transition}s are initiated in response to events, \eg blocking, interrupts, errors: 30 47 \begin{itemize} 31 48 \item 32 49 entering the system (new $\rightarrow$ ready) 50 \item 51 scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running) 33 52 \item 34 53 timer alarm for preemption (running $\rightarrow$ ready) … … 36 55 long term delay versus spinning (running $\rightarrow$ blocked) 37 56 \item 38 blocking ends, \ienetwork or I/O completion (blocked $\rightarrow$ ready)57 completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready) 39 58 \item 40 normal completion or error, \ie segment fault (running $\rightarrow$ halted) 41 \item 42 scheduler assigns a thread to a resource (ready $\rightarrow$ running) 59 normal completion or error, \eg segment fault (running $\rightarrow$ halted) 43 60 \end{itemize} 44 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system .61 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads. 45 62 46 63 When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. 47 64 Ready queues organize threads for scheduling, which indirectly organizes the work to be performed. 48 The structure of ready queues can take many different forms. 49 Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS). 65 The structure of ready queues can take many different forms, where the basic two are the single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS). 50 66 \begin{center} 51 67 \begin{tabular}{l|l} … … 55 71 \end{tabular} 56 72 \end{center} 57 Beyond these two schedulers are a host of options, \ ie adding an optional global, shared queue to MQMS.73 Beyond these two schedulers are a host of options, \eg adding an global shared queue to MQMS or adding multiple private queues with distinc characteristics. 58 74 59 The three major optimization criteria for a scheduler are:75 Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria: 60 76 \begin{enumerate}[leftmargin=*] 61 77 \item … … 70 86 Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. 71 87 When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. 72 That is, threads must be scheduled on multipleprocessors to obtain high processors utilization because the number of threads $\ggg$ processors.88 That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors. 73 89 74 90 \item 75 \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{ 76 Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.} 77 78 \noindent 79 Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object. 91 \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking.\footnote{ 92 Lock-free data-structures do not involve locking but incur similar costs to achieve mutual exclusion.} 93 Mutual exclusion cost and latency increases significantly with the number of processors access\-ing a shared object. 80 94 \end{enumerate} 81 95 82 Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns. 83 Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}. 84 SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. 85 MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. 96 Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{ 97 Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.} 98 Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns. 99 For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue. 100 While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue. 86 101 87 Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.102 Significant research effort has looked at load balancing by stealing/sharing work units among queues: when a ready queue is too short or long, respectively, load stealing/sharing schedulers attempt to push/pull work units to/from other ready queues. 88 103 These approaches attempt to perform better load-balancing at the cost of affinity and contention. 89 Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues 104 However, \emph{all} approaches come at a cost, but not all compromises are necessarily equivalent, especially across workloads. 105 Hence, some schedulers perform very well for specific workloads, while others offer acceptable performance over a wider range of workloads. 90 106 91 Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent. 92 Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads. 93 Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals. 107 \section{\CFA programming language} 108 109 The \CFA programming language~\cite{Cforall,Moss18} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility. 110 Among its productivity features, \CFA supports \gls{uthrding}~\cite{Delisle21} as its fundamental threading model allowing programmers to easily write modern concurrent and parallel programs. 111 My previous master's thesis on concurrency in \CFA focused on features and interfaces~\cite{Delisle18}. 112 This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations. 113 Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling. 114 Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{ 115 C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.} 116 117 Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals. 94 118 More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). 119 The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. 120 To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application. 95 121 122 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers. 123 The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance; 124 some of these optimizations rely on interesting hardware optimizations only present on modern CPUs. 125 The \io implementation is based on the @io_uring@ kernel-interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io. 126 This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc. 127 While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility. 128 The hope is that these new features will soon become mainstream features. 96 129 97 130 \section{Contributions}\label{s:Contributions} 98 This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:131 This work provides the following scheduling contributions for advanced \gls{uthrding} runtime-systems: 99 132 \begin{enumerate}[leftmargin=*] 100 133 \item 101 134 A scalable scheduling algorithm that offers progress guarantees. 102 135 \item 136 Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@. 137 \item 103 138 An algorithm for load-balancing and idle sleep of processors, including NUMA awareness. 104 139 \item 105 Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@. 140 A mechanism for adding fairness on top of MQMS algorithm through helping, used both for scalable scheduling algorithm and the user-level \glsxtrshort{io}. 141 \item 142 An optimization of the helping-mechanism for load balancing to reduce scheduling costs. 143 \item 144 An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases. 106 145 \end{enumerate} -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r741e22c r71cf630 1 \chapter{User Level \io} 1 \chapter{User Level \io}\label{userio} 2 2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations. 3 3 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system. … … 6 6 Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work. 7 7 8 \subsection{\lstinline{O_NONBLOCK}} 8 \subsection{\lstinline{O_NONBLOCK}}\label{ononblock} 9 9 In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''. 10 10 In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}. … … 141 141 In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading. 142 142 However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run. 143 This approach is used by languages like Go\cit {Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.143 This approach is used by languages like Go\cite{GITHUB:go}, frameworks like libuv\cite{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. 144 144 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. 145 145 As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking. … … 148 148 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete. 149 149 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically. 150 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cit {https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.150 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. 151 151 152 152 For this project, I selected @io_uring@, in large parts because of its generality. -
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r741e22c r71cf630 60 60 To achieve this goal requires each reader to have its own memory to mark as locked and unlocked. 61 61 The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock. 62 The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.62 The write acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks. 63 63 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 65 64 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 66 65 The lock in nonblocking, so both readers and writers spin while the lock is held. 67 \todo{finish explanation} 66 This very wide sharding strategy means that readers have very good locality, since they only ever need to access two memory location. 68 67 69 68 \begin{figure} … … 113 112 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work. 114 113 Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work. 114 115 115 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready. 116 116 Since waking up a sleeping \proc can have notable latency, it is possible multiple \ats become ready while a single \proc is waking up. … … 137 137 138 138 \subsection{Event FDs} 139 Another interesting approach is to use an event file descriptor\cit {eventfd}.139 Another interesting approach is to use an event file descriptor\cite{eventfd}. 140 140 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore. 141 141 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits. … … 217 217 \end{figure} 218 218 219 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cit {benaphore} in front of the event @fd@.219 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cite{schillings1996engineering} in front of the event @fd@. 220 220 The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicit in Figure~\ref{fig:idle:state}. 221 221 A \proc begins its idle sleep by adding itself to the idle list before searching for an \at. -
doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex
r741e22c r71cf630 1 \chapter{\CFA Runtime} 1 \chapter{\CFA Runtime}\label{cfaruntime} 2 2 This chapter presents an overview of the capabilities of the \CFA runtime prior to this thesis work. 3 3 … … 62 62 Only UNIX @man@ pages identify whether or not a library function is thread safe, and hence, may block on a pthreads lock or system call; hence interoperability with UNIX library functions is a challenge for an M:N threading model. 63 63 64 Languages like Go and Java, which have strict interoperability with C\cit {JNI, GoLang with C}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.64 Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}. Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur. 65 65 66 66 As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions. Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together. This fine-grained interoperability between C and \CFA has two consequences:
Note:
See TracChangeset
for help on using the changeset viewer.