# Changeset e9e3d02

Ignore:
Timestamp:
Aug 18, 2022, 9:56:43 PM (6 weeks ago)
Branches:
Children:
fcfbc52
Parents:
ff370d8
Message:

small changes to Churn

File:
1 edited

### Legend:

Unmodified
 rff370d8 The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily. \section{Cycling latency} \section{Cycle} The most basic evaluation of any ready queue is the latency needed to push and pop one element from the ready queue. It is difficult to draw conclusions for this benchmark when runtime system treat @yield@ so differently. The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie it the \CFA semantics matches with programmer intuition.. The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. \section{Churn} The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application. In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other. With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either steal'' the \at from another processor or find it on a remote queue. This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. In either case, this benchmark aims to measure how well each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. Hence, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data. In either case, this benchmark aims to measure how well a scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. This benchmark uses a fixed-size array of counting semaphores. Each \at picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore. Each \at picks a random semaphore, @V@s it to unblock any waiting \at, and then @P@s (maybe blocks) the \ats on the semaphore. This creates a flow where \ats push each other out of the semaphores before being pushed out themselves. For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs. For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs; \eg if there are 10 semaphores and 5 \procs, but only 3 \ats, all 3 \ats can block (P) on a random semaphore and now there is no \ats to unblock (V) them. Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@. Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@. \subsection{Results} Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput as a function of \proc count on Intel and AMD respectively. It uses the same representation as the previous benchmark : 15 runs where the dashed line show the extremes and the solid line the median. Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput on Intel and AMD respectively. The performance cost of crossing the cache boundaries is still visible at the same \proc count. However, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data. Scalability is notably worst than the previous benchmarks since there is inherently more communication between processors. Indeed, once the number of \glspl{hthrd} goes beyond a single socket, performance ceases to improve.