Ignore:
Timestamp:
Aug 18, 2022, 9:43:37 AM (2 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
e9e3d02
Parents:
36cc24a
Message:

more structuring work in chapter eval_micro

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r36cc24a rff370d8  
    22
    33The first step in evaluating this work is to test small controlled cases to ensure the basics work properly.
    4 This chapter presents five different experimental setups, evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
     4This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
    55All of these systems have a \gls{uthrding} model.
    6 Note, all tests in each system are functionally identical and available online~\cite{SchedulingBenchmarks}.
     6The goal in this chapter is show the \CFA scheduler obtains equivalent performance to other less fair schedulers through the different experiments.
     7Note, only the code of the \CFA tests is shown;
     8all tests in the other systems are functionally identical and available online~\cite{SchedulingBenchmarks}.
    79
    810\section{Benchmark Environment}\label{microenv}
     11
    912All benchmarks are run on two distinct hardware platforms.
    1013\begin{description}
     
    2528If more \glspl{hthrd} are needed, then 1 NUMA node with hyperthreading is used.
    2629If still more \glspl{hthrd} are needed, then the experiment is limited to as few NUMA nodes as needed.
     30For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading.
     31This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192.
     32On AMD, the same algorithm is used, but the machine only has 2 sockets.
     33So hyperthreading\footnote{
     34Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.}
     35is used when the \proc count reach 65 and 193.
    2736
    2837The limited sharing of the last-level cache on the AMD machine is markedly different than the Intel machine.
    2938Indeed, while on both architectures L2 cache misses that are served by L3 caches on a different CPU incur a significant latency, on the AMD it is also the case that cache misses served by a different L3 instance on the same CPU also incur high latency.
    3039
     40\section{Experimental setup}
     41
     42Each experiment is run 15 times varying the number of processors depending on the two different computers.
     43All experiments gather throughput data and secondary data for scalability or latency.
     44The data is graphed using a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{
     45An alternative display is to use error bars with min/max as the bottom/top for the bar.
     46However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.}
     47This graph presentation offers an overview of the distribution of the results for each experiment.
     48
     49For each experiment, four graphs are generated showing traditional throughput on the top row and \newterm{scalability} or \newterm{latency} on the bottom row (peek ahead to Figure~\ref{fig:cycle:jax}).
     50Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
     51In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
     52
     53The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.
     54The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time.
     55The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
    3156
    3257\section{Cycling latency}
     
    3661However, yielding can be treated as a special case by optimizing it away since the number of ready \ats does not change.
    3762Hence, systems that perform this optimization have an artificial performance benefit because the yield becomes a \emph{nop}.
    38 For this reason, I chose a different first benchmark, called \newterm{Cycle Benchmark}.
     63For this reason, I designed a different push/pop benchmark, called \newterm{Cycle Benchmark}.
    3964This benchmark arranges a number of \ats into a ring, as seen in Figure~\ref{fig:cycle}, where the ring is a circular singly-linked list.
    4065At runtime, each \at unparks the next \at before parking itself.
     
    5883Finally, to further mitigate any underlying push/pop optimizations, especially on SMP machines, multiple rings are created in the experiment.
    5984
    60 Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark.
     85Figure~\ref{fig:cycle:code} shows the pseudo code for this benchmark, where each cycle has 5 \ats.
    6186There is additional complexity to handle termination (not shown), which requires a binary semaphore or a channel instead of raw @park@/@unpark@ and carefully picking the order of the @P@ and @V@ with respect to the loop condition.
    6287
     
    110135\end{figure}
    111136
    112 \subsection{Results}
    113 
    114 Figures~\ref{fig:cycle:jax} and~\ref{fig:cycle:nasus} show the throughput as a function of \proc count on Intel and AMD respectively, where each cycle has 5 \ats.
    115 The graphs show traditional throughput on the top row and \newterm{scalability} on the bottom row.
    116 Scalability uses the same data as throughput but the Y axis is calculated as the number of \procs over the throughput.
    117 In this representation, perfect scalability should appear as a horizontal line, \eg, if doubling the number of \procs doubles the throughput, then the relation stays the same.
    118 The left column shows results for 100 cycles per \proc, enough cycles to always keep every \proc busy.
    119 The right column shows results for 1 cycle per \proc, where the ready queues are expected to be near empty most of the time.
    120 The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
    121 
    122137\begin{figure}
    123138        \subfloat[][Throughput, 100 cycles per \proc]{
     
    150165\end{figure}
    151166
    152 The experiment ran 15 times for each series and processor count.
    153 Each series has a solid and two dashed lines representing the median, maximum and minimum result respectively, where the minimum/maximum lines are referred to as the \emph{extremes}.\footnote{
    154 An alternative display is to use error bars with min/max as the bottom/top for the bar.
    155 However, this approach is not truly an error bar around a mean value and I felt the connected lines are easier to read.}
    156 This graph presentation offers an overview of the distribution of the results for each series.
    157 
    158 The experimental setup uses taskset to limit the placement of \glspl{kthrd} by the operating system.
    159 As mentioned in Section~\ref{microenv}, the experiment is setup to prioritize running on two \glspl{hthrd} per core before running on multiple sockets.
    160 For the Intel machine, this means that from 1 to 24 \procs one socket and \emph{no} hyperthreading is used, and from 25 to 48 \procs still only one socket is used but \emph{with} hyperthreading.
    161 This pattern is repeated between 49 and 96, between 97 and 144, and between 145 and 192.
    162 On AMD, the same algorithm is used, but the machine only has 2 sockets.
    163 So hyperthreading\footnote{
    164 Hyperthreading normally refers specifically to the technique used by Intel, however it is often used generically to refer to any equivalent feature.}
    165 is used when the \proc count reach 65 and 193.
    166 
    167 The performance goal of \CFA is to obtain equivalent performance to other less fair schedulers.
    168 Figure~\ref{fig:cycle:jax:ops} and Figure~\ref{fig:cycle:jax:ns} show that for 100 cycles per \proc, \CFA, Go and Tokio all obtain effectively the same performance.
     167\subsection{Results}
     168
     169For the Intel architecture, Figure~\ref{fig:cycle:jax}:
     170\begin{itemize}
     171\item
     172For 100 cycles per \proc (first column), \CFA, Go and Tokio all obtain effectively the same throughput performance.
    169173Libfibre is slightly behind in this case but still scales decently.
    170 As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvements for all runtimes.
    171 As expected, this pattern repeats between \proc count 72 and 96.
    172 Hence, Figures~\ref{fig:cycle:jax:ops} and \ref{fig:cycle:jax:ns} show very good throughput and scalability for all runtimes.
    173 
    174 When running only a single cycle, the story is slightly different.
    175 \CFA and Tokio obtain very similar results overall, but Tokio shows notably more variations in the results.
    176 While \CFA, Go and Tokio achieve equivalent performance with 100 cycles per \proc, with only 1 cycle per \proc, Go achieves slightly better performance.
    177 This difference in throughput and scalability is due to the idle-sleep mechanism.
    178 With very few cycles, stealing or helping can cause a cascade of tasks migration and trick a \proc into very short idle sleeps, which negatively affect performance.
    179 
    180 An interesting and unusual result is that libfibre achieves better performance with 1 cycle.
    181 This suggest that the cascade effect is never present in libfibre and that some bottleneck disappears in this context.
    182 However, I did not investigate this result any deeper.
    183 
    184 Figure~\ref{fig:cycle:nasus} show a similar story happening on AMD as it does on Intel.
     174As a result of the \gls{kthrd} placement, additional \procs from 25 to 48 offer less performance improvement (flatting of the line) for all runtimes.
     175As expected, this pattern repeats again between \proc count 72 and 96.
     176\item
     177For 1 cycle per \proc, \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
     178Go achieves slightly better performance.
     179Interestingly, libfibre achieves better performance with 1 cycle.
     180\end{itemize}
     181
     182For the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results show the same story as on the Intel, with close to double the performance overall but with slightly increased variation.
    185183The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
    186 Unlike Intel, on AMD all 4 runtimes achieve very similar throughput and scalability for 100 cycles per \proc, with some variations in the results.
    187 
    188 In the 1 cycle per \proc experiment, the same performance increase for libfibre is visible.
    189 However, unlike on Intel, Tokio achieves the same performance as Go rather than \CFA.
    190 This leaves \CFA trailing behind in this particular case, but only at hight core counts.
     184\begin{itemize}
     185\item
     186For 100 cycles per \proc, unlike Intel, all 4 runtimes achieve very similar throughput and scalability.
     187\item
     188For 1 cycle per \proc, unlike on Intel, Tokio and Go have the same throughput performance, while \CFA is slightly slower.
     189Again, the same performance increase for libfibre is visible.
     190\end{itemize}
     191Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
     192
     193The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark.
     194Clearly, the pathological case with 1 \at per \proc, can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
    191195For this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
    192 Essentially, \CFA's fairness must result in slower performance in some workload.
    193 Fortunately, this effect is only problematic in pathological cases, \eg with 1 \at per \proc, which seldom occurs in most workloads.
    194 
    195 The conclusion from both architectures is that all of the compared runtime have fairly equivalent performance for this micro-benchmark.
    196 This result shows that the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers.
     196For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other less fair schedulers, except for very unusual workloads.
    197197
    198198\section{Yield}
     199
    199200For completion, the classic yield benchmark is included.
    200201Here, the throughput is dominated by the mechanism used to handle the @yield@ function.
     
    305306\end{figure}
    306307
    307 For the AMD, Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased jitter.
     308For the AMD architecture, Figure~\ref{fig:yield:nasus}, the results show the same story as on the Intel, with slightly increased variations.
    308309Also, some transition points on the X-axis differ because of the architectures, like at 16 versus 24 processors.
    309310
Note: See TracChangeset for help on using the changeset viewer.