| 1 | \chapter{Micro-Benchmarks}\label{microbench}
|
|---|
| 2 |
|
|---|
| 3 | The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly.
|
|---|
| 4 | This sections presents five different experimental setup, evaluating some of the basic features of \CFA's scheduler.
|
|---|
| 5 |
|
|---|
| 6 | \section{Cycling latency}
|
|---|
| 7 | The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
|
|---|
| 8 | Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
|
|---|
| 9 | However, yielding can be treated as a special case, since it also carries the information that the number of the ready \glspl{at} will not change.
|
|---|
| 10 | Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs.
|
|---|
| 11 | For this reason, I chose a different first benchmark, which I call the Cycle Benchmark.
|
|---|
| 12 | This benchmark arranges many \glspl{at} into multiple rings of \glspl{at}.
|
|---|
| 13 | Each ring is effectively a circular singly-linked list.
|
|---|
| 14 | At runtime, each \gls{at} unparks the next \gls{at} before parking itself.
|
|---|
| 15 | This corresponds to the desired pair of ready queue operations.
|
|---|
| 16 | Unparking the next \gls{at} requires pushing that \gls{at} onto the ready queue and the ensuing park will cause the runtime to pop a \gls{at} from the ready-queue.
|
|---|
| 17 | Figure~\ref{fig:cycle} shows a visual representation of this arrangement.
|
|---|
| 18 |
|
|---|
| 19 | The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready \glspl{at} will stay constant over the duration of the experiment.
|
|---|
| 20 | In fact, the total number of \glspl{at} waiting on the ready queue is expected to vary because of the race between the next \gls{at} unparking and the current \gls{at} parking.
|
|---|
| 21 | The size of the cycle is also decided based on this race: cycles that are too small may see the chain of unparks go full circle before the first \gls{at} can park.
|
|---|
| 22 | While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away.
|
|---|
| 23 | Since silently omitting ready-queue operations would throw off the measuring of these operations, the ring of \glspl{at} must be big enough so the \glspl{at} have the time to fully park before they are unparked.
|
|---|
| 24 | Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system.
|
|---|
| 25 |
|
|---|
| 26 | \begin{figure}
|
|---|
| 27 | \centering
|
|---|
| 28 | \input{cycle.pstex_t}
|
|---|
| 29 | \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.}
|
|---|
| 30 | \label{fig:cycle}
|
|---|
| 31 | \end{figure}
|
|---|
| 32 |
|
|---|
| 33 | \todo{check term ``idle sleep handling''}
|
|---|
| 34 | To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available.
|
|---|
| 35 | Beyond this point, adding more rings serves to mitigate even more the idle sleep handling.
|
|---|
| 36 | This is to avoid the case where one of the worker \glspl{at} runs out of work because of the variation on the number of ready \glspl{at} mentionned above.
|
|---|
| 37 |
|
|---|
| 38 | The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.
|
|---|
| 39 |
|
|---|
| 40 | \todo{code, setup, results}
|
|---|
| 41 | \begin{lstlisting}
|
|---|
| 42 | Thread.main() {
|
|---|
| 43 | count := 0
|
|---|
| 44 | for {
|
|---|
| 45 | wait()
|
|---|
| 46 | this.next.wake()
|
|---|
| 47 | count ++
|
|---|
| 48 | if must_stop() { break }
|
|---|
| 49 | }
|
|---|
| 50 | global.count += count
|
|---|
| 51 | }
|
|---|
| 52 | \end{lstlisting}
|
|---|
| 53 |
|
|---|
| 54 |
|
|---|
| 55 | \section{Yield}
|
|---|
| 56 | For completion, I also include the yield benchmark.
|
|---|
| 57 | This benchmark is much simpler than the cycle tests, it simply creates many \glspl{at} that call \texttt{yield}.
|
|---|
| 58 | As mentionned in the previous section, this benchmark may be less representative of usages that only make limited use of \texttt{yield}, due to potential shortcuts in the routine.
|
|---|
| 59 | Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty.
|
|---|
| 60 | This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done.
|
|---|
| 61 |
|
|---|
| 62 | \todo{code, setup, results}
|
|---|
| 63 |
|
|---|
| 64 | \begin{lstlisting}
|
|---|
| 65 | Thread.main() {
|
|---|
| 66 | count := 0
|
|---|
| 67 | while !stop {
|
|---|
| 68 | yield()
|
|---|
| 69 | count ++
|
|---|
| 70 | }
|
|---|
| 71 | global.count += count
|
|---|
| 72 | }
|
|---|
| 73 | \end{lstlisting}
|
|---|
| 74 |
|
|---|
| 75 |
|
|---|
| 76 | \section{Churn}
|
|---|
| 77 | The Cycle and Yield benchmark represents an ``easy'' scenario for a scheduler, \eg, an embarrassingly parallel application.
|
|---|
| 78 | In these benchmarks, \glspl{at} can be easily partitioned over the different \glspl{proc} up-front and none of the \glspl{at} communicate with each other.
|
|---|
| 79 |
|
|---|
| 80 | The Churn benchmark represents more chaotic usages, where there is no relation between the last \gls{proc} on which a \gls{at} ran and the \gls{proc} that unblocked it.
|
|---|
| 81 | When a \gls{at} is unblocked from a different \gls{proc} than the one on which it last ran, the unblocking \gls{proc} must either ``steal'' the \gls{at} or place it on a remote queue.
|
|---|
| 82 | This results can result in either contention on the remote queue or \glspl{rmr} on \gls{at} data structure.
|
|---|
| 83 | In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly.
|
|---|
| 84 |
|
|---|
| 85 | To achieve this the benchmark uses a fixed size array of \newterm{chair}s, where a chair is a data structure that holds a single blocked \gls{at}.
|
|---|
| 86 | When a \gls{at} attempts to block on the chair, it must first unblocked the \gls{at} currently blocked on said chair, if any.
|
|---|
| 87 | This creates a flow where \glspl{at} push each other out of the chairs before being pushed out themselves.
|
|---|
| 88 | For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of chairs plus the number of \glspl{proc}.
|
|---|
| 89 |
|
|---|
| 90 | \todo{code, setup, results}
|
|---|
| 91 | \begin{lstlisting}
|
|---|
| 92 | Thread.main() {
|
|---|
| 93 | count := 0
|
|---|
| 94 | for {
|
|---|
| 95 | r := random() % len(spots)
|
|---|
| 96 | next := xchg(spots[r], this)
|
|---|
| 97 | if next { next.wake() }
|
|---|
| 98 | wait()
|
|---|
| 99 | count ++
|
|---|
| 100 | if must_stop() { break }
|
|---|
| 101 | }
|
|---|
| 102 | global.count += count
|
|---|
| 103 | }
|
|---|
| 104 | \end{lstlisting}
|
|---|
| 105 |
|
|---|
| 106 | \section{Locality}
|
|---|
| 107 |
|
|---|
| 108 | \todo{code, setup, results}
|
|---|
| 109 |
|
|---|
| 110 | \section{Transfer}
|
|---|
| 111 | The last benchmark is more exactly characterize as an experiment than a benchmark.
|
|---|
| 112 | It tests the behavior of the schedulers for a particularly misbehaved workload.
|
|---|
| 113 | In this workload, one of the \gls{at} is selected at random to be the leader.
|
|---|
| 114 | The leader then spins in a tight loop until it has observed that all other \glspl{at} have acknowledged its leadership.
|
|---|
| 115 | The leader \gls{at} then picks a new \gls{at} to be the ``spinner'' and the cycle repeats.
|
|---|
| 116 |
|
|---|
| 117 | The benchmark comes in two flavours for the behavior of the non-leader \glspl{at}:
|
|---|
| 118 | once they acknowledged the leader, they either block on a semaphore or yield repeatadly.
|
|---|
| 119 |
|
|---|
| 120 | This experiment is designed to evaluate the short term load balancing of the scheduler.
|
|---|
| 121 | Indeed, schedulers where the runnable \glspl{at} are partitioned on the \glspl{proc} may need to balance the \glspl{at} for this experient to terminate.
|
|---|
| 122 | This is because the spinning \gls{at} is effectively preventing the \gls{proc} from runnning any other \glspl{thrd}.
|
|---|
| 123 | In the semaphore flavour, the number of runnable \glspl{at} will eventually dwindle down to only the leader.
|
|---|
| 124 | This is a simpler case to handle for schedulers since \glspl{proc} eventually run out of work.
|
|---|
| 125 | In the yielding flavour, the number of runnable \glspl{at} stays constant.
|
|---|
| 126 | This is a harder case to handle because corrective measures must be taken even if work is still available.
|
|---|
| 127 | Note that languages that have mandatory preemption do circumvent this problem by forcing the spinner to yield.
|
|---|
| 128 |
|
|---|
| 129 | \todo{code, setup, results}
|
|---|
| 130 | \begin{lstlisting}
|
|---|
| 131 | Thread.lead() {
|
|---|
| 132 | this.idx_seen = ++lead_idx
|
|---|
| 133 | if lead_idx > stop_idx {
|
|---|
| 134 | done := true
|
|---|
| 135 | return
|
|---|
| 136 | }
|
|---|
| 137 |
|
|---|
| 138 | // Wait for everyone to acknowledge my leadership
|
|---|
| 139 | start: = timeNow()
|
|---|
| 140 | for t in threads {
|
|---|
| 141 | while t.idx_seen != lead_idx {
|
|---|
| 142 | asm pause
|
|---|
| 143 | if (timeNow() - start) > 5 seconds { error() }
|
|---|
| 144 | }
|
|---|
| 145 | }
|
|---|
| 146 |
|
|---|
| 147 | // pick next leader
|
|---|
| 148 | leader := threads[ prng() % len(threads) ]
|
|---|
| 149 |
|
|---|
| 150 | // wake every one
|
|---|
| 151 | if !exhaust {
|
|---|
| 152 | for t in threads {
|
|---|
| 153 | if t != me { t.wake() }
|
|---|
| 154 | }
|
|---|
| 155 | }
|
|---|
| 156 | }
|
|---|
| 157 |
|
|---|
| 158 | Thread.wait() {
|
|---|
| 159 | this.idx_seen := lead_idx
|
|---|
| 160 | if exhaust { wait() }
|
|---|
| 161 | else { yield() }
|
|---|
| 162 | }
|
|---|
| 163 |
|
|---|
| 164 | Thread.main() {
|
|---|
| 165 | while !done {
|
|---|
| 166 | if leader == me { this.lead() }
|
|---|
| 167 | else { this.wait() }
|
|---|
| 168 | }
|
|---|
| 169 | }
|
|---|
| 170 | \end{lstlisting}
|
|---|