Ignore:
Timestamp:
Nov 24, 2022, 3:41:44 PM (17 months ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master
Children:
dacd8e6e
Parents:
82a90d4
Message:

Last corrections to my thesis... hopefully

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r82a90d4 rddcaff6  
    44This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers.
    55All of these systems have a \gls{uthrding} model.
    6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.
     6The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees.
    77Note that only the code of the \CFA tests is shown;
    8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.
     8all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself.
    99
    1010\section{Benchmark Environment}\label{microenv}
     
    129129        \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    130130        For throughput, higher is better, for scalability, lower is better.
    131         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     131        Each series represents 15 independent runs.
     132        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    132133        \label{fig:cycle:jax}
    133134\end{figure}
     
    161162        \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts.
    162163        For throughput, higher is better, for scalability, lower is better.
    163         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     164        Each series represents 15 independent runs.
     165        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    164166        \label{fig:cycle:nasus}
    165167\end{figure}
     
    177179Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc.
    178180\CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results.
    179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worst performance compared to the left column.
     181Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column.
    180182This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism.
    181183This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available.
     
    185187Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details.
    186188Note the maximum of the Y-axis on Intel and AMD differ significantly.
    187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
     189Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability.
    188190However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
    189191The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
     
    191193This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
    192194Again, the same performance increase for libfibre is visible when running fewer \ats.
    193 Note, I did not investigate the libfibre performance boost for 1 cycle in this experiment.
     195I did not investigate the libfibre performance boost for 1 cycle in this experiment.
    194196
    195197The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark.
    196198Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts.
    197199In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal.
    198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other, less fair, schedulers.
     200For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees.
    199201
    200202\section{Yield}
     
    250252        \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    251253        For throughput, higher is better, for scalability, lower is better.
    252         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     254        Each series represents 15 independent runs.
     255        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    253256        \label{fig:yield:jax}
    254257\end{figure}
     
    309312        \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    310313        For throughput, higher is better, for scalability, lower is better.
    311         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     314        Each series represents 15 independent runs.
     315        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    312316        \label{fig:yield:nasus}
    313317\end{figure}
     
    317321Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling.
    318322Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs.
    319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.
     323This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed.
    320324Go and Tokio still display the same performance collapse as on Intel.
    321325Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine.
     
    324328
    325329It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently.
    326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
     330The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
    327331
    328332
     
    333337
    334338The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it.
    335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
     339With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
    336340This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
    337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other's data.
     341Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data.
    338342In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly.
    339343
     
    392396        \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    393397        For throughput, higher is better, for scalability, lower is better.
    394         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     398        Each series represents 15 independent runs.
     399        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    395400        \label{fig:churn:jax}
    396401\end{figure}
     
    404409Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs.
    405410Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values.
    406 Finally, Go gets the most peculiar results, scaling worst than other runtimes until 48 \procs.
     411Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs.
    407412At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing.
    408413However, beyond this point Go keeps this level of variation but does not scale further in any of the runs.
    409414
    410 Throughput and scalability are notably worst for all runtimes than the previous benchmarks since there is inherently more communication between processors.
     415Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors.
    411416Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second.
    412417Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling.
    413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worst and performance completely ceases to improve.
     418However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve.
    414419At this point, the benchmark is dominated by inter-socket communication costs for all runtimes.
    415420
     
    457462        \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    458463        For throughput, higher is better, for scalability, lower is better.
    459         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     464        Each series represents 15 independent runs.
     465        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    460466        \label{fig:churn:nasus}
    461467\end{figure}
     
    600606        \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count.
    601607        For throughput, higher is better, for scalability, lower is better.
    602         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     608        Each series represents 15 independent runs.
     609        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    603610        \label{fig:locality:jax}
    604611\end{figure}
     
    632639        \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count.
    633640        For throughput, higher is better, for scalability, lower is better.
    634         Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     641        Each series represents 15 independent runs.
     642        The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
    635643        \label{fig:locality:nasus}
    636644\end{figure}
     
    648656Go still has the same poor performance as on Intel.
    649657
    650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
     658Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.
    651659Go still has the same poor performance.
    652660
     
    751759\end{centering}
    752760\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
     761For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time).
    753762DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
    754763\label{fig:transfer:res}
     
    764773
    765774The first two columns show the results for the semaphore variation on Intel.
    766 While there are some differences in latencies, \CFA is consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
    767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.
     775While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.
     776Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal.
    768777
    769778Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
     
    780789Neither Libfibre nor Tokio complete the experiment.
    781790
    782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.
     791This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee.
    783792The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly.
    784793Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader.
     
    790799Without \procs stealing from the \proc running the leader, the experiment cannot terminate.
    791800Go manages to complete the experiment because it adds preemption on top of classic work-stealing.
    792 However, since preemption is fairly infrequent, it achieves significantly worst performance.
     801However, since preemption is fairly infrequent, it achieves significantly worse performance.
    793802In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness.
    794803Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.
Note: See TracChangeset for help on using the changeset viewer.