Changeset e9e3d02


Ignore:
Timestamp:
Aug 18, 2022, 9:56:43 PM (6 weeks ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master, pthread-emulation
Children:
fcfbc52
Parents:
ff370d8
Message:

small changes to Churn

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    rff370d8 re9e3d02  
    5555The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily.
    5656
    57 \section{Cycling latency}
     57\section{Cycle}
    5858
    5959The most basic evaluation of any ready queue is the latency needed to push and pop one element from the ready queue.
     
    310310
    311311It is difficult to draw conclusions for this benchmark when runtime system treat @yield@ so differently.
    312 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie it the \CFA semantics matches with programmer intuition..
     312The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.
    313313
    314314
    315315\section{Churn}
     316
    316317The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application.
    317318In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other.
     
    320321With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.
    321322This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure.
    322 In either case, this benchmark aims to measure how well each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.
     323Hence, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data.
     324In either case, this benchmark aims to measure how well a scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly.
    323325
    324326This benchmark uses a fixed-size array of counting semaphores.
    325 Each \at picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.
     327Each \at picks a random semaphore, @V@s it to unblock any waiting \at, and then @P@s (maybe blocks) the \ats on the semaphore.
    326328This creates a flow where \ats push each other out of the semaphores before being pushed out themselves.
    327 For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs.
     329For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs;
     330\eg if there are 10 semaphores and 5 \procs, but only 3 \ats, all 3 \ats can block (P) on a random semaphore and now there is no \ats to unblock (V) them.
    328331Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@.
    329332Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@.
     
    378381
    379382\subsection{Results}
    380 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput as a function of \proc count on Intel and AMD respectively.
    381 It uses the same representation as the previous benchmark : 15 runs where the dashed line show the extremes and the solid line the median.
     383
     384Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput on Intel and AMD respectively.
     385
    382386The performance cost of crossing the cache boundaries is still visible at the same \proc count.
    383 However, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data.
     387
    384388Scalability is notably worst than the previous benchmarks since there is inherently more communication between processors.
    385389Indeed, once the number of \glspl{hthrd} goes beyond a single socket, performance ceases to improve.
Note: See TracChangeset for help on using the changeset viewer.