| 1 | \chapter{Micro-Benchmarks}\label{microbench}
|
|---|
| 2 |
|
|---|
| 3 | The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly.
|
|---|
| 4 | This sections presents four different experimental setup, evaluating some of the basic features of \CFA's scheduler.
|
|---|
| 5 |
|
|---|
| 6 | \section{Cycling latency}
|
|---|
| 7 | The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
|
|---|
| 8 | While these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
|
|---|
| 9 | However, yielding can be treated as a special case, since it also carries the information that the length of the ready queue will not change.
|
|---|
| 10 | Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs.
|
|---|
| 11 | For this reason, I chose a different first benchmark, which I call the Cycle Benchmark.
|
|---|
| 12 | This benchmark arranges many threads into multiple rings of threads.
|
|---|
| 13 | Each ring is effectively a circular singly-linked list.
|
|---|
| 14 | At runtime, each thread unparks the next thread before parking itself.
|
|---|
| 15 | This corresponds to the desired pair of ready queue operations.
|
|---|
| 16 | Unparking the next thread requires pushing that thread onto the ready queue and the ensuing park will cause the runtime to pop a thread from the ready-queue.
|
|---|
| 17 | Figure~\ref{fig:cycle} shows a visual representation of this arrangement.
|
|---|
| 18 |
|
|---|
| 19 | The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready threads will stay constant over the duration of the experiment.
|
|---|
| 20 | In fact, the total number of threads waiting on the ready is expected to vary a little because of the race between the next thread unparking and the current thread parking.
|
|---|
| 21 | The size of the cycle is also decided based on this race: cycles that are too small may see the
|
|---|
| 22 | chain of unparks go full circle before the first thread can park.
|
|---|
| 23 | While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away.
|
|---|
| 24 | Since silently omitting ready-queue operations would throw off the measuring of these operations.
|
|---|
| 25 | Therefore the ring of threads must be big enough so the threads have the time to fully park before they are unparked.
|
|---|
| 26 | Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system.
|
|---|
| 27 |
|
|---|
| 28 | \begin{figure}
|
|---|
| 29 | \centering
|
|---|
| 30 | \input{cycle.pstex_t}
|
|---|
| 31 | \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each thread unparks the next thread in the cycle before parking itself.}
|
|---|
| 32 | \label{fig:cycle}
|
|---|
| 33 | \end{figure}
|
|---|
| 34 |
|
|---|
| 35 | \todo{check term ``idle sleep handling''}
|
|---|
| 36 | To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of processors available.
|
|---|
| 37 | Beyond this point, adding more rings serves to mitigate even more the idle sleep handling.
|
|---|
| 38 | This is to avoid the case where one of the worker threads runs out of work because of the variation on the number of ready threads mentionned above.
|
|---|
| 39 |
|
|---|
| 40 | The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.
|
|---|
| 41 |
|
|---|
| 42 | \todo{mention where to get the code.}
|
|---|
| 43 |
|
|---|
| 44 | \section{Yield}
|
|---|
| 45 | For completion, I also include the yield benchmark.
|
|---|
| 46 | This benchmark is much simpler than the cycle tests, it simply creates many threads that call \texttt{yield}.
|
|---|
| 47 |
|
|---|
| 48 | \section{Locality}
|
|---|
| 49 |
|
|---|
| 50 | \section{Transfer}
|
|---|