Context Navigation

← Previous Change
Next Change →

Changeset 72d1118 for doc/theses/thierry_delisle_PhD

Timestamp:

Sep 20, 2022, 2:42:15 PM (2 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation

Children:

Parents:

Message:

Finished final read

Location:

doc/theses/thierry_delisle_PhD/thesis

Files:

: 4 edited

local.bib (modified) (5 diffs)
text/eval_micro.tex (modified) (6 diffs)
text/io.tex (modified) (4 diffs)
text/practice.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/local.bib

-                      raa9f215
+                      r72d1118
   title={Per-entity load tracking},
   author={Corbet, Jonathan},
   journal={LWN article, available at: https://lwn.net/Articles/531853},
+  journal={LWN article, available at: {\href{https://lwn.net/Articles/531853}{https://\-lwn.net/\-Articles/\-531853}}},
   year={2013}
+}
 …
   title = {Scheduling Benchmarks},
   author = {Thierry Delisle},
   howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-SchedulingBenchmarks\_\-PhD22}},
+  howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-Scheduling\-Benchmarks\_\-PhD22}},
+}
 …
   title      = "eventfd(2) Linux User's Manual",
   year       = "2019",
   month      = "MArch",
+  month      = "March",
+}
 …
   year = "2020",
   month = "June",
   howpublished = "\href{https://xkcd.com/2318/}",
+  howpublished = "\href{https://xkcd.com/2318/}{https://\-xkcd.com/\-2318/}",
   note = "[Online; accessed 10-June-2020]"
+}
 …
   year = "2011",
   month = "June",
   howpublished = "\href{https://xkcd.com/908/}",
+  howpublished = "\href{https://xkcd.com/908/}{https://\-xkcd.com/\-908/}",
   note = "[Online; accessed 25-August-2022]"
+}

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

-                      raa9f215
+                      r72d1118
 Looking at the left column on AMD, Figures~\ref{fig:cycle:nasus:ops} and \ref{fig:cycle:nasus:ns} all 4 runtimes achieve very similar throughput and scalability.
 However, as the number of \procs grows higher, the results on AMD show notably more variability than on Intel.
 The different performance improvements and plateaus are due to cache topology and appear at the expected: \proc counts of 64, 128 and 192, for the same reasons as on Intel.
+The different performance improvements and plateaus are due to cache topology and appear at the expected \proc counts of 64, 128 and 192, for the same reasons as on Intel.
 Looking next at the right column on AMD, Figures~\ref{fig:cycle:nasus:low:ops} and \ref{fig:cycle:nasus:low:ns}, Tokio and Go have the same throughput performance, while \CFA is slightly slower.
 This result is different than on Intel, where Tokio behaved like \CFA rather than behaving like Go.
 …
 \section{Locality}
 As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote ready-queue.\footnote{
 It is also possible to \unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
+As mentioned in the churn benchmark, when \glslink{atsched}{unparking} a \at, it is possible to either \unpark to the local or remote sub-queue.\footnote{
+It is also possible to \unpark to a third unrelated sub-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
 The locality experiment includes two variations of the churn benchmark, where a data array is added.
 In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
 …
 In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
 In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $(CSL + SL) / (NP - 1)$,
 where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, and $NP$ is the number of \procs.
+In an ideal scenario where the scheduler is strictly FIFO, every thread would run once after the synchronization and therefore the delay between leaders would be given by, $NT(CSL + SL) / (NP - 1)$,
+where $CSL$ is the context-switch latency, $SL$ is the cost for enqueueing and dequeuing a \at, $NT$ is the number of \ats, and $NP$ is the number of \procs.
 However, if the scheduler allows \ats to run many times before other \ats can run once, this delay increases.
 The semaphore version is an approximation of strictly FIFO scheduling, where none of the \ats \emph{attempt} to run more than once.
 …
 \begin{table}
-\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
-DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
-\label{fig:transfer:res}
 \setlength{\extrarowheight}{2pt}
 \setlength{\tabcolsep}{5pt}
 …
 \end{tabular}
 \end{centering}
+\caption[Transfer Benchmark on Intel and AMD]{Transfer Benchmark on Intel and AMD\smallskip\newline Average measurement of how long it takes for all \ats to acknowledge the leader \at.
+DNC stands for ``did not complete'', meaning that after 5 seconds of a new leader being decided, some \ats still had not acknowledged the new leader.}
+\label{fig:transfer:res}
 \end{table}
 …
 Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
 \CFA achieves better latencies, presumably due to no synchronization with the yield.
+\CFA achieves better latencies, presumably due to the lack of synchronization with the yield.
 Go does complete the experiment, but with drastically higher latency:
 latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs.
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption
+, which interrupts the spinning leader after a period.
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption, which interrupts the spinning leader after a period.
 Neither Libfibre nor Tokio complete the experiment.
 Both runtimes also use classical work-stealing scheduling without preemption, and therefore, none of the work queues are ever emptied so no load balancing occurs.

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-                      raa9f215
+                      r72d1118
 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
 Hence, the array length must be as long as the largest FD currently of interest.
 On return, it outputs the set in place to identify which of the file descriptors changed state.
+On return, it outputs the set motified in place to identify which of the file descriptors changed state.
 This destructive change means selecting in a loop requires re-initializing the array for each iteration.
 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
 …
 \paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
 For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.
 Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
+However, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
 Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \glspl{kthrd}, while a manager thread is blocked in @poll@.
 …
 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
 With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
+With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
 …
 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
 Compared to the private-instance approach, all this synchronization has a significant cost this synchronization is entirely overhead.
+Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
 \subsubsection{Instance borrowing}

doc/theses/thierry_delisle_PhD/thesis/text/practice.tex

-                      raa9f215
+                      r72d1118
 This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
 While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the processor.
+While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the \gls{hthrd}.
 Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed.
 …
 First, a data structure needs to keep track of all \procs that are in idle sleep.
 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
+Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore.
 The complexity here is to support \at \glslink{atblock}{parking} and \glslink{atsched}{unparking}, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: