Changeset ff370d8 for doc/theses/thierry_delisle_PhD/thesis
- Timestamp:
- Aug 18, 2022, 9:43:37 AM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- e9e3d02
- Parents:
- 36cc24a
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r36cc24a rff370d8 2 2 3 3 The first step in evaluating this work is to test small controlled cases to ensure the basics work properly. 4 This chapter presents five different experimental setups ,evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. 5 5 All of these systems have a \gls{uthrding} model. 6 Note, all tests in each system are functionally identical and available online~\cite{SchedulingBenchmarks}. 6 The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments. 7 Note, only the code of the \CFA tests is shown; 8 all tests in the other systems are functionally identical and available online~\cite{SchedulingBenchmarks}. 7 9 8 10 \section{Benchmark Environment}\label{microenv} 11 9 12 All benchmarks are run on two distinct hardware platforms. 10 13 \begin{description} … … 25 28 If more \glspl{hthrd} are needed, then 1 NUMA node with hyperthreading is used. 26 29 If still more \glspl{hthrd} are needed, then the experiment is limited to as few NUMA nodes as needed. 30 For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading. 31 This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192. 32 On AMD, the same algorithm is used, but the machine only has 2 sockets. 33 So hyperthreading\footnote{ 34 Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.} 35 is used when the \proc count reach 65 and 193. 27 36 28 37 The limited sharing of the last-level cache on the AMD machine is markedly different than the Intel machine. 29 38 Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different CPU incur a significant latency, on the AMD it is also the case that cache misses served by a different L3 instance on the same CPU also incur high latency. 30 39 40 \section{Experimental setup} 41 42 Each experiment is run 15 times varying the number of processors depending on the two different computers. 43 All experiments gather throughput data and secondary data for scalability or latency. 44 The data is graphed using a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{ 45 An alternative display is to use error bars with min/max as the bottom/top for the bar. 46 However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.} 47 This graph presentation offers an overview of the distribution of the results for each experiment. 48 49 For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}). 50 Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput. 51 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same. 52 53 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy. 54 The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time. 55 The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily. 31 56 32 57 \section{Cycling latency} … … 36 61 However, yielding can be treated as a special case by optimizing it away since the number of ready \ats does not change. 37 62 Hence, systems that perform this optimization have an artificial performance benefit because the yield becomes a \emph{nop}. 38 For this reason, I chose a different firstbenchmark, called \newterm{Cycle Benchmark}.63 For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}. 39 64 This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list. 40 65 At runtime, each \at unparks the next \at before parking itself. … … 58 83 Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment. 59 84 60 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark .85 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats. 61 86 There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition. 62 87 … … 110 135 \end{figure} 111 136 112 \subsection{Results}113 114 Figures~\ref{fig:cycle:jax} and~\ref{fig:cycle:nasus} show the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats.115 The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row.116 Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.117 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.118 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.119 The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time.120 The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.121 122 137 \begin{figure} 123 138 \subfloat[][Throughput, 100 cycles per \proc]{ … … 150 165 \end{figure} 151 166 152 The experiment ran 15 times for each series and processor count. 153 Each series has a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{ 154 An alternative display is to use error bars with min/max as the bottom/top for the bar. 155 However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.} 156 This graph presentation offers an overview of the distribution of the results for each series. 157 158 The experimental setup uses taskset to limit the placement of \glspl{kthrd} by the operating system. 159 As mentioned in Section~\ref{microenv}, the experiment is setup to prioritize running on two \glspl{hthrd} per core before running on multiple sockets. 160 For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading. 161 This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192. 162 On AMD, the same algorithm is used, but the machine only has 2 sockets. 163 So hyperthreading\footnote{ 164 Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.} 165 is used when the \proc count reach 65 and 193. 166 167 The performance goal of \CFA is to obtain equivalent performance to other less fair schedulers. 168 Figure~\ref{fig:cycle:jax:ops} and Figure~\ref{fig:cycle:jax:ns} show that for 100 cycles per \proc, \CFA, Go and Tokio all obtain effectively the same performance. 167 \subsection{Results} 168 169 For the Intel architecture, Figure~\ref{fig:cycle:jax}: 170 \begin{itemize} 171 \item 172 For 100 cycles per \proc (first column), \CFA, Go and Tokio all obtain effectively the same throughput performance. 169 173 Libfibre is slightly behind in this case but still scales decently. 170 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvements for all runtimes. 171 As expected, this pattern repeats between \proc count 72 and 96. 172 Hence, Figures~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes. 173 174 When running only a single cycle, the story is slightly different. 175 \CFA and Tokio obtain very similar results overall, but Tokio shows notably more variations in the results. 176 While \CFA, Go and Tokio achieve equivalent performance with 100 cycles per \proc, with only 1 cycle per \proc, Go achieves slightly better performance. 177 This difference in throughput and scalability is due to the idle-sleep mechanism. 178 With very few cycles, stealing or helping can cause a cascade of tasks migration and trick a \proc into very short idle sleeps, which negatively affect performance. 179 180 An interesting and unusual result is that libfibre achieves better performance with 1 cycle. 181 This suggest that the cascade effect is never present in libfibre and that some bottleneck disappears in this context. 182 However, I did not investigate this result any deeper. 183 184 Figure~\ref{fig:cycle:nasus} show a similar story happening on AMD as it does on Intel. 174 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement (flatting of the line) for all runtimes. 175 As expected, this pattern repeats again between \proc count 72 and 96. 176 \item 177 For 1 cycle per \proc, \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results. 178 Go achieves slightly better performance. 179 Interestingly, libfibre achieves better performance with 1 cycle. 180 \end{itemize} 181 182 For the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results show the same story as on the Intel, with close to double the performance overall but with slightly increased variation. 185 183 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. 186 Unlike Intel, on AMD all 4 runtimes achieve very similar throughput and scalability for 100 cycles per \proc, with some variations in the results. 187 188 In the 1 cycle per \proc experiment, the same performance increase for libfibre is visible. 189 However, unlike on Intel, Tokio achieves the same performance as Go rather than \CFA. 190 This leaves \CFA trailing behind in this particular case, but only at hight core counts. 184 \begin{itemize} 185 \item 186 For 100 cycles per \proc, unlike Intel, all 4 runtimes achieve very similar throughput and scalability. 187 \item 188 For 1 cycle per \proc, unlike on Intel, Tokio and Go have the same throughput performance, while \CFA is slightly slower. 189 Again, the same performance increase for libfibre is visible. 190 \end{itemize} 191 Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment. 192 193 The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark. 194 Clearly, the pathological case with 1 \at per \proc, can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. 191 195 For this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 192 Essentially, \CFA's fairness must result in slower performance in some workload. 193 Fortunately, this effect is only problematic in pathological cases, \eg with 1 \at per \proc, which seldom occurs in most workloads. 194 195 The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark. 196 This result shows that the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers. 196 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers, except for very unusual workloads. 197 197 198 198 \section{Yield} 199 199 200 For completion, the classic yield benchmark is included. 200 201 Here, the throughput is dominated by the mechanism used to handle the @yield@ function. … … 305 306 \end{figure} 306 307 307 For the AMD , Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased jitter.308 For the AMD architecture, Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased variations. 308 309 Also, some transition points on the X-axis differ because of the architectures, like at 16 versus 24 processors. 309 310
Note: See TracChangeset
for help on using the changeset viewer.