Changeset 72d1118 for doc/theses/thierry_delisle_PhD
- Timestamp:
- Sep 20, 2022, 2:42:15 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 8f1e035
- Parents:
- aa9f215
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 4 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/local.bib
raa9f215 r72d1118 583 583 title={Per-entity load tracking}, 584 584 author={Corbet, Jonathan}, 585 journal={LWN article, available at: https://lwn.net/Articles/531853},585 journal={LWN article, available at: {\href{https://lwn.net/Articles/531853}{https://\-lwn.net/\-Articles/\-531853}}}, 586 586 year={2013} 587 587 } … … 717 717 title = {Scheduling Benchmarks}, 718 718 author = {Thierry Delisle}, 719 howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-Scheduling Benchmarks\_\-PhD22}},719 howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-Scheduling\-Benchmarks\_\-PhD22}}, 720 720 } 721 721 … … 832 832 title = "eventfd(2) Linux User's Manual", 833 833 year = "2019", 834 month = "M Arch",834 month = "March", 835 835 } 836 836 … … 1060 1060 year = "2020", 1061 1061 month = "June", 1062 howpublished = "\href{https://xkcd.com/2318/} ",1062 howpublished = "\href{https://xkcd.com/2318/}{https://\-xkcd.com/\-2318/}", 1063 1063 note = "[Online; accessed 10-June-2020]" 1064 1064 } … … 1069 1069 year = "2011", 1070 1070 month = "June", 1071 howpublished = "\href{https://xkcd.com/908/} ",1071 howpublished = "\href{https://xkcd.com/908/}{https://\-xkcd.com/\-908/}", 1072 1072 note = "[Online; accessed 25-August-2022]" 1073 1073 } -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
raa9f215 r72d1118 187 187 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability. 188 188 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel. 189 The different performance improvements and plateaus are due to cache topology and appear at the expected :\proc counts of 64, 128 and 192, for the same reasons as on Intel.189 The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel. 190 190 Looking next at the right column on AMD, Figures~\ref{fig:cycle:nasus:low:ops} and \ref{fig:cycle:nasus:low:ns}, Tokio and Go have the same throughput performance, while \CFA is slightly slower. 191 191 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go. … … 491 491 \section{Locality} 492 492 493 As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{494 It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}493 As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote sub-queue.\footnote{ 494 It is also possible to \unpark to a third unrelated sub-queue, but without additional knowledge about the situation, it is likely to degrade performance.} 495 495 The locality experiment includes two variations of the churn benchmark, where a data array is added. 496 496 In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array. … … 720 720 721 721 In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point. 722 In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $ (CSL + SL) / (NP - 1)$,723 where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, and $NP$ is the number of \procs.722 In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $NT(CSL + SL) / (NP - 1)$, 723 where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, $NT$ is the number of \ats, and $NP$ is the number of \procs. 724 724 However, if the scheduler allows \ats to run many times before other \ats can run once, this delay increases. 725 725 The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once. … … 734 734 735 735 \begin{table} 736 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.737 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}738 \label{fig:transfer:res}739 736 \setlength{\extrarowheight}{2pt} 740 737 \setlength{\tabcolsep}{5pt} … … 753 750 \end{tabular} 754 751 \end{centering} 752 \caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at. 753 DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.} 754 \label{fig:transfer:res} 755 755 \end{table} 756 756 … … 768 768 769 769 Looking at the next two columns, the results for the yield variation on Intel, the story is very different. 770 \CFA achieves better latencies, presumably due to nosynchronization with the yield.770 \CFA achieves better latencies, presumably due to the lack of synchronization with the yield. 771 771 Go does complete the experiment, but with drastically higher latency: 772 772 latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs. 773 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption 774 , which interrupts the spinning leader after a period. 773 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption, which interrupts the spinning leader after a period. 775 774 Neither Libfibre nor Tokio complete the experiment. 776 775 Both runtimes also use classical work-stealing scheduling without preemption, and therefore, none of the work queues are ever emptied so no load balancing occurs. -
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
raa9f215 r72d1118 25 25 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest. 26 26 Hence, the array length must be as long as the largest FD currently of interest. 27 On return, it outputs the set in place to identify which of the file descriptors changed state.27 On return, it outputs the set motified in place to identify which of the file descriptors changed state. 28 28 This destructive change means selecting in a loop requires re-initializing the array for each iteration. 29 29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. … … 35 35 \paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs. 36 36 For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel. 37 Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.37 However, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call. 38 38 Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \glspl{kthrd}, while a manager thread is blocked in @poll@. 39 39 … … 314 314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. 315 315 316 With the pool of S EQinstances approach, the big advantage is that it is fairly flexible.316 With the pool of SQE instances approach, the big advantage is that it is fairly flexible. 317 317 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. 318 318 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. … … 320 320 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. 321 321 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. 322 Compared to the private-instance approach, all this synchronization has a significant cost this synchronization is entirely overhead.322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead. 323 323 324 324 \subsubsection{Instance borrowing} -
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
raa9f215 r72d1118 101 101 This leaves too many \procs when there are not enough \ats for all the \procs to be useful. 102 102 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease. 103 While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the processor.103 While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the \gls{hthrd}. 104 104 Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed. 105 105 … … 107 107 First, a data structure needs to keep track of all \procs that are in idle sleep. 108 108 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements. 109 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ o na pthread semaphore.109 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore. 110 110 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity. 111 111 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
Note: See TracChangeset
for help on using the changeset viewer.