1 | \chapter{Micro-Benchmarks}\label{microbench} |
---|
2 | |
---|
3 | The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly. |
---|
4 | This sections presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler. |
---|
5 | |
---|
6 | \section{Cycling latency} |
---|
7 | The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue. |
---|
8 | Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark. |
---|
9 | However, yielding can be treated as a special case, since it also carries the information that the number of the ready \glspl{at} will not change. |
---|
10 | Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs. |
---|
11 | For this reason, I chose a different first benchmark, which I call the Cycle Benchmark. |
---|
12 | This benchmark arranges many \glspl{at} into multiple rings of \glspl{at}. |
---|
13 | Each ring is effectively a circular singly-linked list. |
---|
14 | At runtime, each \gls{at} unparks the next \gls{at} before parking itself. |
---|
15 | This corresponds to the desired pair of ready queue operations. |
---|
16 | Unparking the next \gls{at} requires pushing that \gls{at} onto the ready queue and the ensuing park will cause the runtime to pop a \gls{at} from the ready-queue. |
---|
17 | Figure~\ref{fig:cycle} shows a visual representation of this arrangement. |
---|
18 | |
---|
19 | The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready \glspl{at} will stay constant over the duration of the experiment. |
---|
20 | In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking. |
---|
21 | The size of the cycle is also decided based on this race: cycles that are too small may see the chain of unparks go full circle before the first \gls{at} can park. |
---|
22 | While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away. |
---|
23 | Since silently omitting ready-queue operations would throw off the measuring of these operations, the ring of \glspl{at} must be big enough so the \glspl{at} have the time to fully park before they are unparked. |
---|
24 | Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system. |
---|
25 | |
---|
26 | \begin{figure} |
---|
27 | \centering |
---|
28 | \input{cycle.pstex_t} |
---|
29 | \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.} |
---|
30 | \label{fig:cycle} |
---|
31 | \end{figure} |
---|
32 | |
---|
33 | \todo{check term ``idle sleep handling''} |
---|
34 | To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available. |
---|
35 | Beyond this point, adding more rings serves to mitigate even more the idle sleep handling. |
---|
36 | This is to avoid the case where one of the worker \glspl{at} runs out of work because of the variation on the number of ready \glspl{at} mentionned above. |
---|
37 | |
---|
38 | The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition. |
---|
39 | |
---|
40 | \todo{code, setup, results} |
---|
41 | \begin{lstlisting} |
---|
42 | Thread.main() { |
---|
43 | count := 0 |
---|
44 | for { |
---|
45 | wait() |
---|
46 | this.next.wake() |
---|
47 | count ++ |
---|
48 | if must_stop() { break } |
---|
49 | } |
---|
50 | global.count += count |
---|
51 | } |
---|
52 | \end{lstlisting} |
---|
53 | |
---|
54 | |
---|
55 | \section{Yield} |
---|
56 | For completion, I also include the yield benchmark. |
---|
57 | This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call \texttt{yield}. |
---|
58 | As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of \texttt{yield}, due to potential shortcuts in the routine. |
---|
59 | Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty. |
---|
60 | This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done. |
---|
61 | |
---|
62 | \todo{code, setup, results} |
---|
63 | |
---|
64 | \begin{lstlisting} |
---|
65 | Thread.main() { |
---|
66 | count := 0 |
---|
67 | while !stop { |
---|
68 | yield() |
---|
69 | count ++ |
---|
70 | } |
---|
71 | global.count += count |
---|
72 | } |
---|
73 | \end{lstlisting} |
---|
74 | |
---|
75 | |
---|
76 | \section{Churn} |
---|
77 | The Cycle and Yield benchmark represents an ``easy'' scenario for a scheduler, \eg, an embarrassingly parallel application. |
---|
78 | In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} up-front and none of the \glspl{at} communicate with each other. |
---|
79 | |
---|
80 | The Churn benchmark represents more chaotic usages, where there is no relation between the last \gls{proc} on which a \gls{at} ran and the \gls{proc} that unblocked it. |
---|
81 | When a \gls{at} is unblocked from a different \gls{proc} than the one on which it last ran, the unblocking \gls{proc} must either ``steal'' the \gls{at} or place it on a remote queue. |
---|
82 | This results can result in either contention on the remote queue or \glspl{rmr} on \gls{at} data structure. |
---|
83 | In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly. |
---|
84 | |
---|
85 | To achieve this the benchmark uses a fixed size array of \newterm{chair}s, where a chair is a data structure that holds a single blocked \gls{at}. |
---|
86 | When a \gls{at} attempts to block on the chair, it must first unblocked the \gls{at} currently blocked on said chair, if any. |
---|
87 | This creates a flow where \glspl{at} push each other out of the chairs before being pushed out themselves. |
---|
88 | For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of chairs plus the number of \glspl{proc}. |
---|
89 | |
---|
90 | \todo{code, setup, results} |
---|
91 | \begin{lstlisting} |
---|
92 | Thread.main() { |
---|
93 | count := 0 |
---|
94 | for { |
---|
95 | r := random() % len(spots) |
---|
96 | next := xchg(spots[r], this) |
---|
97 | if next { next.wake() } |
---|
98 | wait() |
---|
99 | count ++ |
---|
100 | if must_stop() { break } |
---|
101 | } |
---|
102 | global.count += count |
---|
103 | } |
---|
104 | \end{lstlisting} |
---|
105 | |
---|
106 | \section{Locality} |
---|
107 | |
---|
108 | \todo{code, setup, results} |
---|
109 | |
---|
110 | \section{Transfer} |
---|
111 | The last benchmark is more exactly characterize as an experiment than a benchmark. |
---|
112 | It tests the behavior of the schedulers for a particularly misbehaved workload. |
---|
113 | In this workload, one of the \gls{at} is selected at random to be the leader. |
---|
114 | The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership. |
---|
115 | The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats. |
---|
116 | |
---|
117 | The benchmark comes in two flavours for the behavior of the non-leader \glspl{at}: |
---|
118 | once they acknowledged the leader, they either block on a semaphore or yield repeatadly. |
---|
119 | |
---|
120 | This experiment is designed to evaluate the short term load balancing of the scheduler. |
---|
121 | Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experient to terminate. |
---|
122 | This is because the spinning \gls{at} is effectively preventing the \gls{proc} from runnning any other \glspl{thrd}. |
---|
123 | In the semaphore flavour, the number of runnable \glspl{at} will eventually dwindle down to only the leader. |
---|
124 | This is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work. |
---|
125 | In the yielding flavour, the number of runnable \glspl{at} stays constant. |
---|
126 | This is a harder case to handle because corrective measures must be taken even if work is still available. |
---|
127 | Note that languages that have mandatory preemption do circumvent this problem by forcing the spinner to yield. |
---|
128 | |
---|
129 | \todo{code, setup, results} |
---|
130 | \begin{lstlisting} |
---|
131 | Thread.lead() { |
---|
132 | this.idx_seen = ++lead_idx |
---|
133 | if lead_idx > stop_idx { |
---|
134 | done := true |
---|
135 | return |
---|
136 | } |
---|
137 | |
---|
138 | // Wait for everyone to acknowledge my leadership |
---|
139 | start: = timeNow() |
---|
140 | for t in threads { |
---|
141 | while t.idx_seen != lead_idx { |
---|
142 | asm pause |
---|
143 | if (timeNow() - start) > 5 seconds { error() } |
---|
144 | } |
---|
145 | } |
---|
146 | |
---|
147 | // pick next leader |
---|
148 | leader := threads[ prng() % len(threads) ] |
---|
149 | |
---|
150 | // wake every one |
---|
151 | if !exhaust { |
---|
152 | for t in threads { |
---|
153 | if t != me { t.wake() } |
---|
154 | } |
---|
155 | } |
---|
156 | } |
---|
157 | |
---|
158 | Thread.wait() { |
---|
159 | this.idx_seen := lead_idx |
---|
160 | if exhaust { wait() } |
---|
161 | else { yield() } |
---|
162 | } |
---|
163 | |
---|
164 | Thread.main() { |
---|
165 | while !done { |
---|
166 | if leader == me { this.lead() } |
---|
167 | else { this.wait() } |
---|
168 | } |
---|
169 | } |
---|
170 | \end{lstlisting} |
---|