# Changeset 622a358 for doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

Ignore:
Timestamp:
May 18, 2022, 3:59:14 PM (5 months ago)
Branches:
Children:
288927f
Parents:
fa2a3b1
Message:

A whole lot of results and some text section done

File:
1 edited

### Legend:

Unmodified
 rfa2a3b1 \section{Benchmark Environment} All of these benchmarks are run on two distinct hardware environment, an AMD and an INTEL machine. For all benchmarks, \texttt{taskset} is used to limit the experiment to 1 NUMA Node with no hyper threading. If more \glspl{hthrd} are needed, then 1 NUMA Node with hyperthreading is used. If still more \glspl{hthrd} are needed then the experiment is limited to as few NUMA Nodes as needed. \paragraph{AMD} The AMD machine is a server with two AMD EPYC 7662 CPUs and 256GB of DDR4 RAM. \section{Cycling latency} \begin{figure} \centering \input{cycle.pstex_t} \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.} \label{fig:cycle} \end{figure} The most basic evaluation of any ready queue is to evaluate the latency needed to push and pop one element from the ready-queue. Since these two operation also describe a \texttt{yield} operation, many systems use this as the most basic benchmark. Note that this problem is only present on SMP machines and is significantly mitigated by the fact that there are multiple rings in the system. \begin{figure} \centering \input{cycle.pstex_t} \caption[Cycle benchmark]{Cycle benchmark\smallskip\newline Each \gls{at} unparks the next \gls{at} in the cycle before parking itself.} \label{fig:cycle} \end{figure} To avoid this benchmark from being dominated by the idle sleep handling, the number of rings is kept at least as high as the number of \glspl{proc} available. Beyond this point, adding more rings serves to mitigate even more the idle sleep handling. The actual benchmark is more complicated to handle termination, but that simply requires using a binary semphore or a channel instead of raw \texttt{park}/\texttt{unpark} and carefully picking the order of the \texttt{P} and \texttt{V} with respect to the loop condition. \begin{lstlisting} Thread.main() { count := 0 for { wait() this.next.wake() count ++ if must_stop() { break } } global.count += count } \end{lstlisting} \begin{figure} \centering \input{result.cycle.jax.ops.pstex_t} \vspace*{-10pt} \label{fig:cycle:ns:jax} \end{figure} Figure~\ref{fig:cycle:code} shows pseudo code for this benchmark. \begin{figure} \begin{lstlisting} Thread.main() { count := 0 for { wait() this.next.wake() count ++ if must_stop() { break } } global.count += count } \end{lstlisting} \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code} \label{fig:cycle:code} \end{figure} \subsection{Results} \begin{figure} \subfloat[][Throughput, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.cycle.jax.ops.pstex_t} } \label{fig:cycle:jax:ops} } \subfloat[][Throughput, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.cycle.low.jax.ops.pstex_t} } \label{fig:cycle:jax:low:ops} } \subfloat[][Latency, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.cycle.jax.ns.pstex_t} } } \subfloat[][Latency, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.cycle.low.jax.ns.pstex_t} } \label{fig:cycle:jax:low:ns} } \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 100 cycles per \proc, 5 \ats per cycle.} \label{fig:cycle:jax} \end{figure} Figure~\ref{fig:cycle:jax} shows the throughput as a function of \proc count, with the following constants: Each run uses 100 cycles per \proc, 5 \ats per cycle. \todo{results discussion} \section{Yield} Its only interesting variable is the number of \glspl{at} per \glspl{proc}, where ratios close to 1 means the ready queue(s) could be empty. This sometimes puts more strain on the idle sleep handling, compared to scenarios where there is clearly plenty of work to be done. \todo{code, setup, results} \begin{lstlisting} Thread.main() { count := 0 while !stop { yield() count ++ } global.count += count } \end{lstlisting} Figure~\ref{fig:yield:code} shows pseudo code for this benchmark, the wait/wake-next'' is simply replaced by a yield. \begin{figure} \begin{lstlisting} Thread.main() { count := 0 for { yield() count ++ if must_stop() { break } } global.count += count } \end{lstlisting} \caption[Yield Benchmark : Pseudo Code]{Yield Benchmark : Pseudo Code} \label{fig:yield:code} \end{figure} \subsection{Results} \begin{figure} \subfloat[][Throughput, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.yield.jax.ops.pstex_t} } \label{fig:yield:jax:ops} } \subfloat[][Throughput, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.yield.low.jax.ops.pstex_t} } \label{fig:yield:jax:low:ops} } \subfloat[][Latency, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.yield.jax.ns.pstex_t} } \label{fig:yield:jax:ns} } \subfloat[][Latency, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.yield.low.jax.ns.pstex_t} } \label{fig:yield:jax:low:ns} } \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput as a function of \proc count, using 1 \ats per \proc.} \label{fig:yield:jax} \end{figure} Figure~\ref{fig:yield:ops:jax} shows the throughput as a function of \proc count, with the following constants: Each run uses 100 \ats per \proc. \todo{results discussion} In either case, this benchmark aims to highlight how each scheduler handles these cases, since both cases can lead to performance degradation if they are not handled correctly. To achieve this the benchmark uses a fixed size array of \newterm{chair}s, where a chair is a data structure that holds a single blocked \gls{at}. When a \gls{at} attempts to block on the chair, it must first unblocked the \gls{at} currently blocked on said chair, if any. This creates a flow where \glspl{at} push each other out of the chairs before being pushed out themselves. For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of chairs plus the number of \glspl{proc}. To achieve this the benchmark uses a fixed size array of semaphores. Each \gls{at} picks a random semaphore, \texttt{V}s it to unblock a \at waiting and then \texttt{P}s on the semaphore. This creates a flow where \glspl{at} push each other out of the semaphores before being pushed out themselves. For this benchmark to work however, the number of \glspl{at} must be equal or greater to the number of semaphores plus the number of \glspl{proc}. Note that the nature of these semaphores mean the counter can go beyond 1, which could lead to calls to \texttt{P} not blocking. \todo{code, setup, results} for { r := random() % len(spots) next := xchg(spots[r], this) if next { next.wake() } wait() spots[r].V() spots[r].P() count ++ if must_stop() { break } } \end{lstlisting} \begin{figure} \subfloat[][Throughput, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.churn.jax.ops.pstex_t} } \label{fig:churn:jax:ops} } \subfloat[][Throughput, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.churn.low.jax.ops.pstex_t} } \label{fig:churn:jax:low:ops} } \subfloat[][Latency, 100 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.churn.jax.ns.pstex_t} } } \subfloat[][Latency, 1 \ats per \proc]{ \resizebox{0.5\linewidth}{!}{ \input{result.churn.low.jax.ns.pstex_t} } \label{fig:churn:jax:low:ns} } \caption[Churn Benchmark on Intel]{\centering Churn Benchmark on Intel\smallskip\newline Throughput and latency of the Churn on the benchmark on the Intel machine. Throughput is the total operation per second across all cores. Latency is the duration of each opeartion.} \label{fig:churn:jax} \end{figure} \section{Locality}