[36a05d7] | 1 | \chapter{Micro-Benchmarks}\label{microbench} |
---|
| 2 | |
---|
| 3 | The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly. |
---|
[729c991] | 4 | This sections presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler. |
---|
[36a05d7] | 5 | |
---|
| 6 | \section{Cycling latency} |
---|
| 7 | The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue. |
---|
[729c991] | 8 | Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark. |
---|
| 9 | However, yielding can be treated as a special case, since it also carries the information that the number of the ready \glspl{at} will not change. |
---|
[36a05d7] | 10 | Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs. |
---|
| 11 | For this reason, I chose a different first benchmark, which I call the Cycle Benchmark. |
---|
[729c991] | 12 | This benchmark arranges many \glspl{at} into multiple rings of \glspl{at}. |
---|
[36a05d7] | 13 | Each ring is effectively a circular singly-linked list. |
---|
[729c991] | 14 | At runtime, each \gls{at} unparks the next \gls{at} before parking itself. |
---|
[36a05d7] | 15 | This corresponds to the desired pair of ready queue operations. |
---|
[729c991] | 16 | Unparking the next \gls{at} requires pushing that \gls{at} onto the ready queue and the ensuing park will cause the runtime to pop a \gls{at} from the ready-queue. |
---|
[36a05d7] | 17 | Figure~\ref{fig:cycle} shows a visual representation of this arrangement. |
---|
| 18 | |
---|
[729c991] | 19 | The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready \glspl{at} will stay constant over the duration of the experiment. |
---|
| 20 | In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking. |
---|
| 21 | The size of the cycle is also decided based on this race: cycles that are too small may see the chain of unparks go full circle before the first \gls{at} can park. |
---|
[36a05d7] | 22 | While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away. |
---|
[729c991] | 23 | Since silently omitting ready-queue operations would throw off the measuring of these operations, the ring of \glspl{at} must be big enough so the \glspl{at} have the time to fully park before they are unparked. |
---|
[36a05d7] | 24 | Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system. |
---|
| 25 | |
---|
| 26 | \begin{figure} |
---|
| 27 | \centering |
---|
| 28 | \input{cycle.pstex_t} |
---|
[729c991] | 29 | \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.} |
---|
[36a05d7] | 30 | \label{fig:cycle} |
---|
| 31 | \end{figure} |
---|
| 32 | |
---|
| 33 | \todo{check term ``idle sleep handling''} |
---|
[729c991] | 34 | To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available. |
---|
[36a05d7] | 35 | Beyond this point, adding more rings serves to mitigate even more the idle sleep handling. |
---|
[729c991] | 36 | This is to avoid the case where one of the worker \glspl{at} runs out of work because of the variation on the number of ready \glspl{at} mentionned above. |
---|
[36a05d7] | 37 | |
---|
| 38 | The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition. |
---|
| 39 | |
---|
[729c991] | 40 | \todo{code, setup, results} |
---|
| 41 | \begin{lstlisting} |
---|
| 42 | Thread.main() { |
---|
| 43 | count := 0 |
---|
| 44 | for { |
---|
| 45 | wait() |
---|
| 46 | this.next.wake() |
---|
| 47 | count ++ |
---|
| 48 | if must_stop() { break } |
---|
| 49 | } |
---|
| 50 | global.count += count |
---|
| 51 | } |
---|
| 52 | \end{lstlisting} |
---|
| 53 | |
---|
[36a05d7] | 54 | |
---|
| 55 | \section{Yield} |
---|
| 56 | For completion, I also include the yield benchmark. |
---|
[729c991] | 57 | This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call \texttt{yield}. |
---|
| 58 | As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of \texttt{yield}, due to potential shortcuts in the routine. |
---|
| 59 | Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty. |
---|
| 60 | This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done. |
---|
| 61 | |
---|
| 62 | \todo{code, setup, results} |
---|
| 63 | |
---|
| 64 | \begin{lstlisting} |
---|
| 65 | Thread.main() { |
---|
| 66 | count := 0 |
---|
| 67 | while !stop { |
---|
| 68 | yield() |
---|
| 69 | count ++ |
---|
| 70 | } |
---|
| 71 | global.count += count |
---|
| 72 | } |
---|
| 73 | \end{lstlisting} |
---|
| 74 | |
---|
| 75 | |
---|
| 76 | \section{Churn} |
---|
| 77 | The Cycle and Yield benchmark represents an ``easy'' scenario for a scheduler, \eg, an embarrassingly parallel application. |
---|
| 78 | In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} up-front and none of the \glspl{at} communicate with each other. |
---|
| 79 | |
---|
| 80 | The Churn benchmark represents more chaotic usages, where there is no relation between the last \gls{proc} on which a \gls{at} ran and the \gls{proc} that unblocked it. |
---|
| 81 | When a \gls{at} is unblocked from a different \gls{proc} than the one on which it last ran, the unblocking \gls{proc} must either ``steal'' the \gls{at} or place it on a remote queue. |
---|
| 82 | This results can result in either contention on the remote queue or \glspl{rmr} on \gls{at} data structure. |
---|
| 83 | In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly. |
---|
| 84 | |
---|
| 85 | To achieve this the benchmark uses a fixed size array of \newterm{chair}s, where a chair is a data structure that holds a single blocked \gls{at}. |
---|
| 86 | When a \gls{at} attempts to block on the chair, it must first unblocked the \gls{at} currently blocked on said chair, if any. |
---|
| 87 | This creates a flow where \glspl{at} push each other out of the chairs before being pushed out themselves. |
---|
| 88 | For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of chairs plus the number of \glspl{proc}. |
---|
| 89 | |
---|
| 90 | \todo{code, setup, results} |
---|
| 91 | \begin{lstlisting} |
---|
| 92 | Thread.main() { |
---|
| 93 | count := 0 |
---|
| 94 | for { |
---|
| 95 | r := random() % len(spots) |
---|
| 96 | next := xchg(spots[r], this) |
---|
| 97 | if next { next.wake() } |
---|
| 98 | wait() |
---|
| 99 | count ++ |
---|
| 100 | if must_stop() { break } |
---|
| 101 | } |
---|
| 102 | global.count += count |
---|
| 103 | } |
---|
| 104 | \end{lstlisting} |
---|
[36a05d7] | 105 | |
---|
| 106 | \section{Locality} |
---|
| 107 | |
---|
[729c991] | 108 | \todo{code, setup, results} |
---|
| 109 | |
---|
| 110 | \section{Transfer} |
---|
| 111 | The last benchmark is more exactly characterize as an experiment than a benchmark. |
---|
| 112 | It tests the behavior of the schedulers for a particularly misbehaved workload. |
---|
| 113 | In this workload, one of the \gls{at} is selected at random to be the leader. |
---|
| 114 | The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership. |
---|
| 115 | The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats. |
---|
| 116 | |
---|
| 117 | The benchmark comes in two flavours for the behavior of the non-leader \glspl{at}: |
---|
| 118 | once they acknowledged the leader, they either block on a semaphore or yield repeatadly. |
---|
| 119 | |
---|
| 120 | This experiment is designed to evaluate the short term load balancing of the scheduler. |
---|
| 121 | Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experient to terminate. |
---|
| 122 | This is because the spinning \gls{at} is effectively preventing the \gls{proc} from runnning any other \glspl{thrd}. |
---|
| 123 | In the semaphore flavour, the number of runnable \glspl{at} will eventually dwindle down to only the leader. |
---|
| 124 | This is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work. |
---|
| 125 | In the yielding flavour, the number of runnable \glspl{at} stays constant. |
---|
| 126 | This is a harder case to handle because corrective measures must be taken even if work is still available. |
---|
| 127 | Note that languages that have mandatory preemption do circumvent this problem by forcing the spinner to yield. |
---|
| 128 | |
---|
| 129 | \todo{code, setup, results} |
---|
| 130 | \begin{lstlisting} |
---|
| 131 | Thread.lead() { |
---|
| 132 | this.idx_seen = ++lead_idx |
---|
| 133 | if lead_idx > stop_idx { |
---|
| 134 | done := true |
---|
| 135 | return |
---|
| 136 | } |
---|
| 137 | |
---|
| 138 | // Wait for everyone to acknowledge my leadership |
---|
| 139 | start: = timeNow() |
---|
| 140 | for t in threads { |
---|
| 141 | while t.idx_seen != lead_idx { |
---|
| 142 | asm pause |
---|
| 143 | if (timeNow() - start) > 5 seconds { error() } |
---|
| 144 | } |
---|
| 145 | } |
---|
| 146 | |
---|
| 147 | // pick next leader |
---|
| 148 | leader := threads[ prng() % len(threads) ] |
---|
| 149 | |
---|
| 150 | // wake every one |
---|
| 151 | if !exhaust { |
---|
| 152 | for t in threads { |
---|
| 153 | if t != me { t.wake() } |
---|
| 154 | } |
---|
| 155 | } |
---|
| 156 | } |
---|
| 157 | |
---|
| 158 | Thread.wait() { |
---|
| 159 | this.idx_seen := lead_idx |
---|
| 160 | if exhaust { wait() } |
---|
| 161 | else { yield() } |
---|
| 162 | } |
---|
| 163 | |
---|
| 164 | Thread.main() { |
---|
| 165 | while !done { |
---|
| 166 | if leader == me { this.lead() } |
---|
| 167 | else { this.wait() } |
---|
| 168 | } |
---|
| 169 | } |
---|
| 170 | \end{lstlisting} |
---|