- Timestamp:
- Nov 24, 2022, 3:41:44 PM (17 months ago)
- Branches:
- ADT, ast-experimental, master
- Children:
- dacd8e6e
- Parents:
- 82a90d4
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r82a90d4 rddcaff6 4 4 This chapter presents five different experimental setups for evaluating the basic features of the \CFA, libfibre~\cite{libfibre}, Go, and Tokio~\cite{Tokio} schedulers. 5 5 All of these systems have a \gls{uthrding} model. 6 The goal of this chapter is to show that the \CFA scheduler obtains equivalent performance to other, less fair, schedulers through the different experiments.6 The goal of this chapter is to show, through the different experiments, that the \CFA scheduler obtains equivalent performance to other schedulers with lesser fairness guarantees. 7 7 Note that only the code of the \CFA tests is shown; 8 all tests in the other systems are functionally identical and available online~\cite{GITHUB:SchedulingBenchmarks}.8 all tests in the other systems are functionally identical and available both online~\cite{GITHUB:SchedulingBenchmarks} and submitted to UWSpace with the thesis itself. 9 9 10 10 \section{Benchmark Environment}\label{microenv} … … 129 129 \caption[Cycle Benchmark on Intel]{Cycle Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 130 130 For throughput, higher is better, for scalability, lower is better. 131 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 131 Each series represents 15 independent runs. 132 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 132 133 \label{fig:cycle:jax} 133 134 \end{figure} … … 161 162 \caption[Cycle Benchmark on AMD]{Cycle Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count, 5 \ats per cycle, and different cycle counts. 162 163 For throughput, higher is better, for scalability, lower is better. 163 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 Each series represents 15 independent runs. 165 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 164 166 \label{fig:cycle:nasus} 165 167 \end{figure} … … 177 179 Looking next at the right column on Intel, Figures~\ref{fig:cycle:jax:low:ops} and \ref{fig:cycle:jax:low:ns} show the results for 1 cycle of 5 \ats for each \proc. 178 180 \CFA and Tokio obtain very similar results overall, but Tokio shows more variations in the results. 179 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly wors tperformance compared to the left column.181 Go achieves slightly better performance than \CFA and Tokio, but all three display significantly worse performance compared to the left column. 180 182 This decrease in performance is likely due to the additional overhead of the idle-sleep mechanism. 181 183 This can either be the result of \procs actually running out of work or simply additional overhead from tracking whether or not there is work available. … … 185 187 Looking now at the results for the AMD architecture, Figure~\ref{fig:cycle:nasus}, the results are overall similar to the Intel results, but with close to double the performance, slightly increased variation, and some differences in the details. 186 188 Note the maximum of the Y-axis on Intel and AMD differ significantly. 187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.189 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns}, all 4 runtimes achieve very similar throughput and scalability. 188 190 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel. 189 191 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. … … 191 193 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go. 192 194 Again, the same performance increase for libfibre is visible when running fewer \ats. 193 Note,I did not investigate the libfibre performance boost for 1 cycle in this experiment.195 I did not investigate the libfibre performance boost for 1 cycle in this experiment. 194 196 195 197 The conclusion from both architectures is that all of the compared runtimes have fairly equivalent performance for this micro-benchmark. 196 198 Clearly, the pathological case with 1 cycle per \proc can affect fairness algorithms managing mostly idle processors, \eg \CFA, but only at high core counts. 197 199 In this case, \emph{any} helping is likely to cause a cascade of \procs running out of work and attempting to steal. 198 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other , less fair, schedulers.200 For this experiment, the \CFA scheduler has achieved the goal of obtaining equivalent performance to other schedulers with lesser fairness guarantees. 199 201 200 202 \section{Yield} … … 250 252 \caption[Yield Benchmark on Intel]{Yield Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 251 253 For throughput, higher is better, for scalability, lower is better. 252 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 254 Each series represents 15 independent runs. 255 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 253 256 \label{fig:yield:jax} 254 257 \end{figure} … … 309 312 \caption[Yield Benchmark on AMD]{Yield Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 310 313 For throughput, higher is better, for scalability, lower is better. 311 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 314 Each series represents 15 independent runs. 315 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 312 316 \label{fig:yield:nasus} 313 317 \end{figure} … … 317 321 Looking at the left column first, Figures~\ref{fig:yield:nasus:ops} and \ref{fig:yield:nasus:ns}, \CFA achieves very similar throughput and scaling. 318 322 Libfibre still outpaces all other runtimes, but it encounters a performance hit at 64 \procs. 319 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not once hyperthreading is needed.323 This anomaly suggests some amount of communication between the \procs that the Intel machine is able to mask where the AMD is not, once hyperthreading is needed. 320 324 Go and Tokio still display the same performance collapse as on Intel. 321 325 Looking next at the right column on AMD, Figures~\ref{fig:yield:nasus:low:ops} and \ref{fig:yield:nasus:low:ns}, all runtime systems effectively behave the same as they did on the Intel machine. … … 324 328 325 329 It is difficult to draw conclusions for this benchmark when runtime systems treat @yield@ so differently. 326 The win for \CFA is its consistency between the cycle and yield benchmarks making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition.330 The win for \CFA is its consistency between the cycle and yield benchmarks, making it simpler for programmers to use and understand, \ie the \CFA semantics match with programmer intuition. 327 331 328 332 … … 333 337 334 338 The Churn benchmark represents more chaotic executions, where there is more communication among \ats but no relationship between the last \proc on which a \at ran and blocked, and the \proc that subsequently unblocks it. 335 With processor-specific ready-queues, when a \at is unblocked by a different \proc that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue.339 With processor-specific ready-queues, when a \at is unblocked by a different \proc, that means the unblocking \proc must either ``steal'' the \at from another processor or find it on a remote queue. 336 340 This dequeuing results in either contention on the remote queue and/or \glspl{rmr} on the \at data structure. 337 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each other 'sdata.341 Hence, this benchmark has performance dominated by the cache traffic as \procs are constantly accessing each others' data. 338 342 In either case, this benchmark aims to measure how well a scheduler handles these cases since both cases can lead to performance degradation if not handled correctly. 339 343 … … 392 396 \caption[Churn Benchmark on Intel]{Churn Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 393 397 For throughput, higher is better, for scalability, lower is better. 394 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 398 Each series represents 15 independent runs. 399 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 395 400 \label{fig:churn:jax} 396 401 \end{figure} … … 404 409 Tokio achieves very similar performance to \CFA, with the starting boost, scaling decently until 48 \procs, drops from 48 to 72 \procs, and starts increasing again to 192 \procs. 405 410 Libfibre obtains effectively the same results as Tokio with slightly less scaling, \ie the scaling curve is the same but with slightly lower values. 406 Finally, Go gets the most peculiar results, scaling wors tthan other runtimes until 48 \procs.411 Finally, Go gets the most peculiar results, scaling worse than other runtimes until 48 \procs. 407 412 At 72 \procs, the results of the Go runtime vary significantly, sometimes scaling sometimes plateauing. 408 413 However, beyond this point Go keeps this level of variation but does not scale further in any of the runs. 409 414 410 Throughput and scalability are notably wors tfor all runtimes than the previous benchmarks since there is inherently more communication between processors.415 Throughput and scalability are notably worse for all runtimes than the previous benchmarks since there is inherently more communication between processors. 411 416 Indeed, none of the runtimes reach 40 million operations per second while in the cycle benchmark all but libfibre reached 400 million operations per second. 412 417 Figures~\ref{fig:churn:jax:ns} and \ref{fig:churn:jax:low:ns} show that for all \proc counts, all runtimes produce poor scaling. 413 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to wors tand performance completely ceases to improve.418 However, once the number of \glspl{hthrd} goes beyond a single socket, at 48 \procs, scaling goes from bad to worse and performance completely ceases to improve. 414 419 At this point, the benchmark is dominated by inter-socket communication costs for all runtimes. 415 420 … … 457 462 \caption[Churn Benchmark on AMD]{Churn Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 458 463 For throughput, higher is better, for scalability, lower is better. 459 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 464 Each series represents 15 independent runs. 465 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 460 466 \label{fig:churn:nasus} 461 467 \end{figure} … … 600 606 \caption[Locality Benchmark on Intel]{Locality Benchmark on Intel\smallskip\newline Throughput and scalability as a function of \proc count. 601 607 For throughput, higher is better, for scalability, lower is better. 602 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 608 Each series represents 15 independent runs. 609 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 603 610 \label{fig:locality:jax} 604 611 \end{figure} … … 632 639 \caption[Locality Benchmark on AMD]{Locality Benchmark on AMD\smallskip\newline Throughput and scalability as a function of \proc count. 633 640 For throughput, higher is better, for scalability, lower is better. 634 Each series represent 15 independent runs, the dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 641 Each series represents 15 independent runs. 642 The dashed lines are the maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 635 643 \label{fig:locality:nasus} 636 644 \end{figure} … … 648 656 Go still has the same poor performance as on Intel. 649 657 650 Finally looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio.658 Finally, looking at the right column, Figures~\ref{fig:locality:nasus:noshare:ops} and \ref{fig:locality:nasus:noshare:ns}, like on Intel, the same performance inversion is present between libfibre and \CFA/Tokio. 651 659 Go still has the same poor performance. 652 660 … … 751 759 \end{centering} 752 760 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. 761 For each runtime, the average is calculated over 100'000 transfers, except for Go which only has 1000 transfer (due to the difference in transfer time). 753 762 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.} 754 763 \label{fig:transfer:res} … … 764 773 765 774 The first two columns show the results for the semaphore variation on Intel. 766 While there are some differences in latencies, \CFA isconsistently the fastest and Tokio the slowest, all runtimes achieve fairly close results.767 Again, this experiment is meant to highlight major differences so latencies within $10\times$ of each other are considered equal.775 While there are some differences in latencies, with \CFA consistently the fastest and Tokio the slowest, all runtimes achieve fairly close results. 776 Again, this experiment is meant to highlight major differences, so latencies within $10\times$ of each other are considered equal. 768 777 769 778 Looking at the next two columns, the results for the yield variation on Intel, the story is very different. … … 780 789 Neither Libfibre nor Tokio complete the experiment. 781 790 782 This experiment clearly demonstrates that \CFA achieves significantly better fairness.791 This experiment clearly demonstrates that \CFA achieves a stronger fairness guarantee. 783 792 The semaphore variation serves as a control, where all runtimes are expected to transfer leadership fairly quickly. 784 793 Since \ats block after acknowledging the leader, this experiment effectively measures how quickly \procs can steal \ats from the \proc running the leader. … … 790 799 Without \procs stealing from the \proc running the leader, the experiment cannot terminate. 791 800 Go manages to complete the experiment because it adds preemption on top of classic work-stealing. 792 However, since preemption is fairly infrequent, it achieves significantly wors tperformance.801 However, since preemption is fairly infrequent, it achieves significantly worse performance. 793 802 In contrast, \CFA achieves equivalent performance in both variations, demonstrating very good fairness. 794 803 Interestingly \CFA achieves better delays in the yielding version than the semaphore version, however, that is likely due to fairness being equivalent but removing the cost of the semaphores and idle sleep.
Note: See TracChangeset
for help on using the changeset viewer.