\chapter{Micro-Benchmarks}\label{microbench}

The first step of evaluation is always to test-out small controlled cases, to ensure that the basics are working properly.
This sections presents four different experimental setup, evaluating some of the basic features of \CFA's scheduler.

\section{Cycling latency}
The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue.
While these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark.
However, yielding can be treated as a special case, since it also carries the information that the length of the ready queue will not change.
Not all systems use this information, but those which do may appear to have better performance than they would for disconnected push/pop pairs.
For this reason, I chose a different first benchmark, which I call the Cycle Benchmark.
This benchmark arranges many threads into multiple rings of threads.
Each ring is effectively a circular singly-linked list.
At runtime, each thread unparks the next thread before parking itself.
This corresponds to the desired pair of ready queue operations.
Unparking the next thread requires pushing that thread onto the ready queue and the ensuing park will cause the runtime to pop a thread from the ready-queue.
Figure~\ref{fig:cycle} shows a visual representation of this arrangement.

The goal of this ring is that the underlying runtime cannot rely on the guarantee that the number of ready threads will stay constant over the duration of the experiment.
In fact, the total number of threads waiting on the ready is expected to vary a little because of the race between the next thread unparking and the current thread parking.
The size of the cycle is also decided based on this race: cycles that are too small may see the
chain of unparks go full circle before the first thread can park.
While this would not be a correctness problem, every runtime system must handle that race, it could lead to pushes and pops being optimized away.
Since silently omitting ready-queue operations would throw off the measuring of these operations.
Therefore the ring of threads must be big enough so the threads have the time to fully park before they are unparked.
Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system.

\begin{figure}
	\centering
	\input{cycle.pstex_t}
	\caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each thread unparks the next thread in the cycle before parking itself.}
	\label{fig:cycle}
\end{figure}

\todo{check term ``idle sleep handling''}
To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of processors available.
Beyond this point, adding more rings serves to mitigate even more the idle sleep handling.
This is to avoid the case where one of the worker threads runs out of work because of the variation on the number of ready threads mentionned above.

The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition.

\todo{mention where to get the code.}

\section{Yield}
For completion, I also include the yield benchmark.
This benchmark is much simpler than the cycle tests, it simply creates many threads that call \texttt{yield}.

\section{Locality}

\section{Transfer}