1 | \chapter{Micro-Benchmarks}\label{microbench} |
---|
2 | |
---|
3 | The first step in evaluating this work is to test-out small controlled cases to ensure the basics work properly. |
---|
4 | This chapter presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler. |
---|
5 | |
---|
6 | \section{Benchmark Environment} |
---|
7 | All benchmarks are run on two distinct hardware platforms. |
---|
8 | \begin{description} |
---|
9 | \item[AMD] is a server with two AMD EPYC 7662 CPUs and 256GB of DDR4 RAM. |
---|
10 | The EPYC CPU has 64 cores with 2 \glspl{hthrd} per core, for 128 \glspl{hthrd} per socket with 2 sockets for a total of 256 \glspl{hthrd}. |
---|
11 | Each CPU has 4 MB, 64 MB and 512 MB of L1, L2 and L3 caches, respectively. |
---|
12 | Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared by 4 cores, therefore 8 \glspl{hthrd}. |
---|
13 | The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55. |
---|
14 | |
---|
15 | \item[Intel] is a server with four Intel Xeon Platinum 8160 CPUs and 384GB of DDR4 RAM. |
---|
16 | The Xeon CPU has 24 cores with 2 \glspl{hthrd} per core, for 48 \glspl{hthrd} per socket with 4 sockets for a total of 196 \glspl{hthrd}. |
---|
17 | Each CPU has 3 MB, 96 MB and 132 MB of L1, L2 and L3 caches respectively. |
---|
18 | Each L1 and L2 instance are only shared by \glspl{hthrd} on a given core, but each L3 instance is shared across the entire CPU, therefore 48 \glspl{hthrd}. |
---|
19 | The server runs Ubuntu 20.04.2 LTS on top of Linux Kernel 5.8.0-55. |
---|
20 | \end{description} |
---|
21 | |
---|
22 | For all benchmarks, @taskset@ is used to limit the experiment to 1 NUMA Node with no hyper threading. |
---|
23 | If more \glspl{hthrd} are needed, then 1 NUMA Node with hyperthreading is used. |
---|
24 | If still more \glspl{hthrd} are needed, then the experiment is limited to as few NUMA Nodes as needed. |
---|
25 | |
---|
26 | The limited sharing of the last-level cache on the AMD machine is markedly different than the Intel machine. |
---|
27 | Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different CPU incur a significant latency, on the AMD it is also the case that cache misses served by a different L3 instance on the same CPU still incur high latency. |
---|
28 | |
---|
29 | |
---|
30 | \section{Cycling latency} |
---|
31 | \begin{figure} |
---|
32 | \centering |
---|
33 | \input{cycle.pstex_t} |
---|
34 | \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.} |
---|
35 | \label{fig:cycle} |
---|
36 | \end{figure} |
---|
37 | The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready queue. |
---|
38 | Since these two operation also describe a @yield@ operation, many systems use this operation as the most basic benchmark. |
---|
39 | However, yielding can be treated as a special case by optimizing it away (dead code) since the number of ready \glspl{at} does not change. |
---|
40 | Not all systems perform this optimization, but those that do have an artificial performance benefit because the yield becomes a \emph{nop}. |
---|
41 | For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}. |
---|
42 | This benchmark arranges a number of \glspl{at} into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list. |
---|
43 | At runtime, each \gls{at} unparks the next \gls{at} before parking itself. |
---|
44 | Unparking the next \gls{at} pushes that \gls{at} onto the ready queue as does the ensuing park. |
---|
45 | |
---|
46 | Hence, the underlying runtime cannot rely on the number of ready \glspl{at} staying constant over the duration of the experiment. |
---|
47 | In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking. |
---|
48 | That is, the runtime cannot anticipate that the current task will immediately park. |
---|
49 | As well, the size of the cycle is also decided based on this race, \eg a small cycle may see the chain of unparks go full circle before the first \gls{at} parks because of time-slicing or multiple \procs. |
---|
50 | Every runtime system must handle this race and cannot optimized away the ready-queue pushes and pops. |
---|
51 | To prevent any attempt of silently omitting ready-queue operations, the ring of \glspl{at} is made big enough so the \glspl{at} have time to fully park before being unparked again. |
---|
52 | (Note, an unpark is like a V on a semaphore, so the subsequent park (P) may not block.) |
---|
53 | Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment. |
---|
54 | |
---|
55 | To avoid this benchmark being affected by idle-sleep handling, the number of rings is multiple times greater than the number of \glspl{proc}. |
---|
56 | This design avoids the case where one of the \glspl{proc} runs out of work because of the variation on the number of ready \glspl{at} mentioned above. |
---|
57 | |
---|
58 | Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark. |
---|
59 | There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition. |
---|
60 | |
---|
61 | \begin{figure} |
---|
62 | \begin{cfa} |
---|
63 | Thread.main() { |
---|
64 | count := 0 |
---|
65 | for { |
---|
66 | @wait()@ |
---|
67 | @this.next.wake()@ |
---|
68 | count ++ |
---|
69 | if must_stop() { break } |
---|
70 | } |
---|
71 | global.count += count |
---|
72 | } |
---|
73 | \end{cfa} |
---|
74 | \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code} |
---|
75 | \label{fig:cycle:code} |
---|
76 | \end{figure} |
---|
77 | |
---|
78 | \subsection{Results} |
---|
79 | Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle. |
---|
80 | |
---|
81 | \begin{figure} |
---|
82 | \subfloat[][Throughput, 100 \ats per \proc]{ |
---|
83 | \resizebox{0.5\linewidth}{!}{ |
---|
84 | \input{result.cycle.jax.ops.pstex_t} |
---|
85 | } |
---|
86 | \label{fig:cycle:jax:ops} |
---|
87 | } |
---|
88 | \subfloat[][Throughput, 1 \ats per \proc]{ |
---|
89 | \resizebox{0.5\linewidth}{!}{ |
---|
90 | \input{result.cycle.low.jax.ops.pstex_t} |
---|
91 | } |
---|
92 | \label{fig:cycle:jax:low:ops} |
---|
93 | } |
---|
94 | |
---|
95 | \subfloat[][Latency, 100 \ats per \proc]{ |
---|
96 | \resizebox{0.5\linewidth}{!}{ |
---|
97 | \input{result.cycle.jax.ns.pstex_t} |
---|
98 | } |
---|
99 | |
---|
100 | } |
---|
101 | \subfloat[][Latency, 1 \ats per \proc]{ |
---|
102 | \resizebox{0.5\linewidth}{!}{ |
---|
103 | \input{result.cycle.low.jax.ns.pstex_t} |
---|
104 | } |
---|
105 | \label{fig:cycle:jax:low:ns} |
---|
106 | } |
---|
107 | \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput as a function of \proc count with 100 cycles per \proc and 5 \ats per cycle.} |
---|
108 | \label{fig:cycle:jax} |
---|
109 | \end{figure} |
---|
110 | |
---|
111 | \todo{results discussion} |
---|
112 | |
---|
113 | \section{Yield} |
---|
114 | For completion, the classic yield benchmark is included. |
---|
115 | This benchmark is simpler than the cycle test: it creates many \glspl{at} that call @yield@. |
---|
116 | As mentioned, this benchmark may not be representative because of optimization shortcuts in @yield@. |
---|
117 | The only interesting variable in this benchmark is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) can be empty. |
---|
118 | This scenario can put a strain on the idle-sleep handling compared to scenarios where there is plenty of work. |
---|
119 | Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, where the @wait/next.wake@ is replaced by @yield@. |
---|
120 | |
---|
121 | \begin{figure} |
---|
122 | \begin{cfa} |
---|
123 | Thread.main() { |
---|
124 | count := 0 |
---|
125 | for { |
---|
126 | @yield()@ |
---|
127 | count ++ |
---|
128 | if must_stop() { break } |
---|
129 | } |
---|
130 | global.count += count |
---|
131 | } |
---|
132 | \end{cfa} |
---|
133 | \caption[Yield Benchmark : Pseudo Code]{Yield Benchmark : Pseudo Code} |
---|
134 | \label{fig:yield:code} |
---|
135 | \end{figure} |
---|
136 | |
---|
137 | \subsection{Results} |
---|
138 | |
---|
139 | Figure~\ref{fig:yield:jax} shows the throughput as a function of \proc count, where each run uses 100 \ats per \proc. |
---|
140 | |
---|
141 | \begin{figure} |
---|
142 | \subfloat[][Throughput, 100 \ats per \proc]{ |
---|
143 | \resizebox{0.5\linewidth}{!}{ |
---|
144 | \input{result.yield.jax.ops.pstex_t} |
---|
145 | } |
---|
146 | \label{fig:yield:jax:ops} |
---|
147 | } |
---|
148 | \subfloat[][Throughput, 1 \ats per \proc]{ |
---|
149 | \resizebox{0.5\linewidth}{!}{ |
---|
150 | \input{result.yield.low.jax.ops.pstex_t} |
---|
151 | } |
---|
152 | \label{fig:yield:jax:low:ops} |
---|
153 | } |
---|
154 | |
---|
155 | \subfloat[][Latency, 100 \ats per \proc]{ |
---|
156 | \resizebox{0.5\linewidth}{!}{ |
---|
157 | \input{result.yield.jax.ns.pstex_t} |
---|
158 | } |
---|
159 | \label{fig:yield:jax:ns} |
---|
160 | } |
---|
161 | \subfloat[][Latency, 1 \ats per \proc]{ |
---|
162 | \resizebox{0.5\linewidth}{!}{ |
---|
163 | \input{result.yield.low.jax.ns.pstex_t} |
---|
164 | } |
---|
165 | \label{fig:yield:jax:low:ns} |
---|
166 | } |
---|
167 | \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 1 \ats per \proc.} |
---|
168 | \label{fig:yield:jax} |
---|
169 | \end{figure} |
---|
170 | |
---|
171 | \todo{results discussion} |
---|
172 | |
---|
173 | \section{Churn} |
---|
174 | The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application. |
---|
175 | In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} upfront and none of the \glspl{at} communicate with each other. |
---|
176 | |
---|
177 | The Churn benchmark represents more chaotic execution, where there is no relation between the last \gls{proc} on which a \gls{at} ran and blocked and the \gls{proc} that subsequently unblocks it. |
---|
178 | With processor-specific ready-queues, when a \gls{at} is unblocked by a different \gls{proc} that means the unblocking \gls{proc} must either ``steal'' the \gls{at} from another processor or find it on a global queue. |
---|
179 | This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on \gls{at} data structure. |
---|
180 | In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. |
---|
181 | |
---|
182 | This benchmark uses a fixed-size array of counting semaphores. |
---|
183 | Each \gls{at} picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore. |
---|
184 | This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves. |
---|
185 | For this benchmark to work, the number of \glspl{at} must be equal or greater than the number of semaphores plus the number of \glspl{proc}. |
---|
186 | Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@. |
---|
187 | Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@. |
---|
188 | |
---|
189 | \begin{figure} |
---|
190 | \begin{cfa} |
---|
191 | Thread.main() { |
---|
192 | count := 0 |
---|
193 | for { |
---|
194 | r := random() % len(spots) |
---|
195 | @spots[r].V()@ |
---|
196 | @spots[r].P()@ |
---|
197 | count ++ |
---|
198 | if must_stop() { break } |
---|
199 | } |
---|
200 | global.count += count |
---|
201 | } |
---|
202 | \end{cfa} |
---|
203 | \caption[Churn Benchmark : Pseudo Code]{Churn Benchmark : Pseudo Code} |
---|
204 | \label{fig:churn:code} |
---|
205 | \end{figure} |
---|
206 | |
---|
207 | \subsection{Results} |
---|
208 | Figure~\ref{fig:churn:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle. |
---|
209 | |
---|
210 | \begin{figure} |
---|
211 | \subfloat[][Throughput, 100 \ats per \proc]{ |
---|
212 | \resizebox{0.5\linewidth}{!}{ |
---|
213 | \input{result.churn.jax.ops.pstex_t} |
---|
214 | } |
---|
215 | \label{fig:churn:jax:ops} |
---|
216 | } |
---|
217 | \subfloat[][Throughput, 1 \ats per \proc]{ |
---|
218 | \resizebox{0.5\linewidth}{!}{ |
---|
219 | \input{result.churn.low.jax.ops.pstex_t} |
---|
220 | } |
---|
221 | \label{fig:churn:jax:low:ops} |
---|
222 | } |
---|
223 | |
---|
224 | \subfloat[][Latency, 100 \ats per \proc]{ |
---|
225 | \resizebox{0.5\linewidth}{!}{ |
---|
226 | \input{result.churn.jax.ns.pstex_t} |
---|
227 | } |
---|
228 | |
---|
229 | } |
---|
230 | \subfloat[][Latency, 1 \ats per \proc]{ |
---|
231 | \resizebox{0.5\linewidth}{!}{ |
---|
232 | \input{result.churn.low.jax.ns.pstex_t} |
---|
233 | } |
---|
234 | \label{fig:churn:jax:low:ns} |
---|
235 | } |
---|
236 | \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. |
---|
237 | Throughput is the total operation per second across all cores. Latency is the duration of each operation.} |
---|
238 | \label{fig:churn:jax} |
---|
239 | \end{figure} |
---|
240 | |
---|
241 | \todo{results discussion} |
---|
242 | |
---|
243 | \section{Locality} |
---|
244 | |
---|
245 | \todo{code, setup, results} |
---|
246 | |
---|
247 | \section{Transfer} |
---|
248 | The last benchmark is more of an experiment than a benchmark. |
---|
249 | It tests the behaviour of the schedulers for a misbehaved workload. |
---|
250 | In this workload, one of the \gls{at} is selected at random to be the leader. |
---|
251 | The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership. |
---|
252 | The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats. |
---|
253 | The benchmark comes in two flavours for the non-leader \glspl{at}: |
---|
254 | once they acknowledged the leader, they either block on a semaphore or spin yielding. |
---|
255 | |
---|
256 | The experiment is designed to evaluate the short-term load-balancing of a scheduler. |
---|
257 | Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experiment to terminate. |
---|
258 | This problem occurs because the spinning \gls{at} is effectively preventing the \gls{proc} from running any other \glspl{thrd}. |
---|
259 | In the semaphore flavour, the number of runnable \glspl{at} eventually dwindles down to only the leader. |
---|
260 | This scenario is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work. |
---|
261 | In the yielding flavour, the number of runnable \glspl{at} stays constant. |
---|
262 | This scenario is a harder case to handle because corrective measures must be taken even when work is available. |
---|
263 | Note, runtime systems with preemption circumvent this problem by forcing the spinner to yield. |
---|
264 | |
---|
265 | \todo{code, setup, results} |
---|
266 | |
---|
267 | \begin{figure} |
---|
268 | \begin{cfa} |
---|
269 | Thread.lead() { |
---|
270 | this.idx_seen = ++lead_idx |
---|
271 | if lead_idx > stop_idx { |
---|
272 | done := true |
---|
273 | return |
---|
274 | } |
---|
275 | |
---|
276 | // Wait for everyone to acknowledge my leadership |
---|
277 | start: = timeNow() |
---|
278 | for t in threads { |
---|
279 | while t.idx_seen != lead_idx { |
---|
280 | asm pause |
---|
281 | if (timeNow() - start) > 5 seconds { error() } |
---|
282 | } |
---|
283 | } |
---|
284 | |
---|
285 | // pick next leader |
---|
286 | leader := threads[ prng() % len(threads) ] |
---|
287 | |
---|
288 | // wake every one |
---|
289 | if ! exhaust { |
---|
290 | for t in threads { |
---|
291 | if t != me { t.wake() } |
---|
292 | } |
---|
293 | } |
---|
294 | } |
---|
295 | |
---|
296 | Thread.wait() { |
---|
297 | this.idx_seen := lead_idx |
---|
298 | if exhaust { wait() } |
---|
299 | else { yield() } |
---|
300 | } |
---|
301 | |
---|
302 | Thread.main() { |
---|
303 | while !done { |
---|
304 | if leader == me { this.lead() } |
---|
305 | else { this.wait() } |
---|
306 | } |
---|
307 | } |
---|
308 | \end{cfa} |
---|
309 | \caption[Transfer Benchmark : Pseudo Code]{Transfer Benchmark : Pseudo Code} |
---|
310 | \label{fig:transfer:code} |
---|
311 | \end{figure} |
---|
312 | |
---|
313 | \subsection{Results} |
---|
314 | Figure~\ref{fig:transfer:jax} shows the throughput as a function of \proc count, where each run uses 100 cycles per \proc and 5 \ats per cycle. |
---|
315 | |
---|
316 | \todo{results discussion} |
---|