Changeset e9e3d02
- Timestamp:
- Aug 18, 2022, 9:56:43 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- fcfbc52
- Parents:
- ff370d8
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
rff370d8 re9e3d02 55 55 The distinction between 100 and 1 cycles is meaningful because the idle sleep subsystem is expected to matter only in the right column, where spurious effects can cause a \proc to run out of work temporarily. 56 56 57 \section{Cycl ing latency}57 \section{Cycle} 58 58 59 59 The most basic evaluation of any ready queue is the latency needed to push and pop one element from the ready queue. … … 310 310 311 311 It is difficult to draw conclusions for this benchmark when runtime system treat @yield@ so differently. 312 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie it the \CFA semantics matches with programmer intuition..312 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. 313 313 314 314 315 315 \section{Churn} 316 316 317 The Cycle and Yield benchmark represent an \emph{easy} scenario for a scheduler, \eg an embarrassingly parallel application. 317 318 In these benchmarks, \ats can be easily partitioned over the different \procs upfront and none of the \ats communicate with each other. … … 320 321 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue. 321 322 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. 322 In either case, this benchmark aims to measure how well each scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. 323 Hence, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data. 324 In either case, this benchmark aims to measure how well a scheduler handles these cases, since both cases can lead to performance degradation if not handled correctly. 323 325 324 326 This benchmark uses a fixed-size array of counting semaphores. 325 Each \at picks a random semaphore, @V@s it to unblock any \at waiting, and then @P@s on the semaphore.327 Each \at picks a random semaphore, @V@s it to unblock any waiting \at, and then @P@s (maybe blocks) the \ats on the semaphore. 326 328 This creates a flow where \ats push each other out of the semaphores before being pushed out themselves. 327 For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs. 329 For this benchmark to work, the number of \ats must be equal or greater than the number of semaphores plus the number of \procs; 330 \eg if there are 10 semaphores and 5 \procs, but only 3 \ats, all 3 \ats can block (P) on a random semaphore and now there is no \ats to unblock (V) them. 328 331 Note, the nature of these semaphores mean the counter can go beyond 1, which can lead to nonblocking calls to @P@. 329 332 Figure~\ref{fig:churn:code} shows pseudo code for this benchmark, where the @yield@ is replaced by @V@ and @P@. … … 378 381 379 382 \subsection{Results} 380 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput as a function of \proc count on Intel and AMD respectively. 381 It uses the same representation as the previous benchmark : 15 runs where the dashed line show the extremes and the solid line the median. 383 384 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the throughput on Intel and AMD respectively. 385 382 386 The performance cost of crossing the cache boundaries is still visible at the same \proc count. 383 However, this benchmark has performance dominated by the cache traffic as \proc are constantly accessing the each other's data. 387 384 388 Scalability is notably worst than the previous benchmarks since there is inherently more communication between processors. 385 389 Indeed, once the number of \glspl{hthrd} goes beyond a single socket, performance ceases to improve.
Note: See TracChangeset
for help on using the changeset viewer.