Context Navigation

← Previous Changeset
Next Changeset →

Changeset c04a19e

Timestamp:

Jan 4, 2021, 3:49:08 PM (6 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast-unique-expr, pthread-emulation, qualifiedEnum, stuck-waitfor-destruct

Children:

Parents:

Message:

Major update to chapter 3

Location:

doc/theses/thierry_delisle_PhD/thesis

Files:

: 1 added
: 5 edited

Makefile (modified) (3 diffs)
fig/fairness.py (added)
glossary.tex (modified) (1 diff)
local.bib (modified) (1 diff)
text/core.tex (modified) (5 diffs)
text/existing.tex (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/Makefile

-              rabc2a643
+              rc04a19e
         base \
         empty \
+        emptybit \
+        emptytls \
+        emptytree \
+        fairness \
         system \
+}
 …
         ${LaTeX} $<
+build/fairness.svg : fig/fairness.py | ${Build}
+        python3 $< $@
 ## Define the default recipes.
 …
         fig2dev -L pstex_t -p ${Build}/$@ $< > ${Build}/$@_t
+%.pstex : build/%.svg | ${Build}
+        inkscape -z -D --file=$< --export-eps=${Build}/$@ --export-latex
+        mv ${Build}/$@_tex ${Build}/$@_t
+        echo "sed -i 's/$@/${Build}/$@/g' ${Build}/$@_t"
+        sed -i 's/$@/${Build}\/$@/g' ${Build}/$@_t
 ## pstex with inverted colors
 %.dark.pstex : fig/%.fig Makefile | ${Build}

doc/theses/thierry_delisle_PhD/thesis/glossary.tex

rabc2a643	rc04a19e
7	7	\newacronym{io}{I/O}{Input and Output}
8	8	\newacronym{numa}{NUMA}{Non-Uniform Memory Access}
	9	\newacronym{prng}{PRNG}{Pseudo Random Number Generator}
9	10	\newacronym{raii}{RAII}{Resource Acquisition Is Initialization}
10	11	\newacronym{tls}{TLS}{Thread Local Storage}

doc/theses/thierry_delisle_PhD/thesis/local.bib

-              rabc2a643
+              rc04a19e
   note = "[Online; accessed 23-October-2020]"
+}
+@misc{wiki:lcg,
+  author = "{Wikipedia contributors}",
+  title = "Linear congruential generator --- {W}ikipedia{,} The Free Encyclopedia",
+  year = "2020",
+  url = "https://en.wikipedia.org/wiki/Linear_congruential_generator",
+  note = "[Online; accessed 2-January-2021]"
+}

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

-              rabc2a643
+              rc04a19e
 \chapter{Scheduling Core}\label{core}
 Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenerio, where the system has reached a steady state. For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and but the system has the ressources necessary to accomplish the work. In short, the system is neither overloaded or underloaded.
+Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenerio, where the system has reached a steady state. For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the ressources necessary to accomplish the work, \eg, enough workers. In short, the system is neither overloaded nor underloaded.
 I believe it is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to the new load and return to the steady state. Flaws in the scheduling in the steady state tend therefore to be pervasive in all states.
+I believe it is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to the new load and return to the steady state, \eg, adding or removing workers. Flaws in the scheduling when the system is in the steady state can therefore to be pervasive in all states.
 \section{Design Goals}
 As with most of the design decisions behind \CFA, the main goal is to match the expectation of the programmer, according to their probable mental model. To match these expectations, the design must offer the programmers sufficient guarantees so that, as long as the programmer respects the mental model, the system will also respect this model.
+As with most of the design decisions behind \CFA, an important goal is to match the expectation of the programmer, according to their probable mental model. To match these expectations, the design must offer the programmers sufficient guarantees so that, as long as they respect the mental model, the system will also respect this model.
 For threading, a simple and common mental model is the ``Ideal multi-tasking CPU'' :
 …
 Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model.
 In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with eachother but simply share the hardware. This makes it easier to reason about threading because ready \glspl{thrd} can be taken in isolation and the effect of the scheduler can be virtually ignored. This expectation of \gls{thrd} independence means the scheduler is expected to offer 2 guarantees:
+In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with eachother but simply share the hardware. This makes it easier to reason about threading because ready \glspl{thrd} can be taken in isolation and the effect of the scheduler can be virtually ignored. This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees:
 \begin{enumerate}
         \item A fairness guarantee: a \gls{thrd} that is ready to run will not be prevented to do so by another thread.
 …
 It is important to note that these guarantees are expected only up to a point. \Glspl{thrd} that are ready to run should not be prevented to do so, but they still need to share a limited amount of hardware. Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware, even if that share is very small.
 Similarly the performance guarantee, the lack of interferance between threads is only relevant op to a point. Ideally the cost of running and blocking would be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. How much is an acceptable cost is obviously highly variable. For this document the performance experimentation will attempt to show that the cost of scheduling is not a major factor in application performance. This demonstration can be made by comparing application built in \CFA to applications built with other languages or other models. If the performance of an application built in \CFA is not meaningfully different than one built with a different runtime, then the scheduler has a negigeable impact on performance, \ie its impact can be ignored. Recall from a few paragraphs ago that the expectation of programmers is that the impact of the scheduler can be ignored. Therefore, if the cost of scheduling is not a significant portion of the runtime of several different application, I will consider the guarantee achieved.
+Similarly the performance guarantee, the lack of interferance between threads, is only relevant up to a point. Ideally the cost of running and blocking would be constant regardless of contention, but the guarantee is considered satisfied if the cost is not \emph{too high} with or without contention. How much is an acceptable cost is obviously highly variable. For this document the performance experimentation will attempt to show that the cost of scheduling is at worst equivalent to existing algorithms used in popular languages. This demonstration can be made by comparing application built in \CFA to applications built with other languages or other models. Recall from a few paragraphs ago that the expectation of programmers is that the impact of the scheduler can be ignored. Therefore, if the cost of scheduling is equivalent or lower to other popular languages, I will consider the guarantee achieved.
+\todo{This paragraph should be moved later}
+% The next step is then to decide what is considered a \emph{fair share}, \ie what metric is used to measure fairness. Since \CFA is intended to allow numerous short lived threads, I decided to avoid total CPU time as the measure of fairness. Total CPU time inherently favors new \glspl{thrd} over older ones which isn't necessarily a good thing. Instead, fairness is measured in terms of opportunities to run. This metric is more appropriate for a mix of short and long lived \glspl{thrd}.
+More precisely the scheduler should be:
+\begin{itemize}
+        \item As fast as other schedulers that are less fair.
+        \item Faster than other scheduler that have equal or better fairness.
+\end{itemize}
+\subsection{Fairness vs Scheduler Locality}
+An important performance factor in modern architectures is cache locality. Waiting for data not present in the cache can have a major impact on performance, and having multiple \glspl{hthrd} writing to the same cache lines can lead to cache lines that need to be waited on again. It is therefore preferable to divide the data among each \gls{hthrd}\footnote{This can be an explicit division up front or using data structures where different \glspl{hthrd} are naturally routed to different cache lines.}.
+For a scheduler, having good locality\footnote{This section discusses \emph{internal} locality, \ie, the locality of the data used by the scheduler. \emph{External locality}, \ie, how the data used by the application is affected by scheduling, is a much more complicated subject and will be discussed in the chapters on evaluation.}, \ie, having the data be local to each \gls{hthrd}, generally conflicts with fairness. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving \gls{thrd}, and as a consequence cache lines, to \glspl{hthrd} that are currently more appropriate.
+However, I claim that in practice it is possible to strike a balance between fairness and performance because the need for these do not necessarily overlap temporaly. Figure~\ref{fig:fair} shows an visual representation of this behaviour. As mentionned, a little bit of unfairness can be acceptable, therefore it can be desirable to have an algorithm that prioritizes cache locality as long as no threads is left behind for too long.
+\begin{figure}
+        \begin{center}
+                \input{fairness.pstex_t}
+        \end{center}
+        \caption{Fairness vs Locality}
+        \label{fig:fair}
+        Rule of thumb graph: Importance of Fairness and Locality while a ready \gls{thrd} waits run.
+        As the time a ready \gls{thrd} waits increases, ``Ready Time'', the chances that its data is still in cache decreases. At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model mentionned above. Since the actual values and curves of this graph can be highly variable, the graph is left intentionally fuzzy and innacurate.
+\end{figure}
 \section{Design}
+While avoiding the pitfalls of Feedback Scheduling is fairly easy, scheduling does not innately require feedback, avoiding prioritization of \glspl{thrd} is more difficult because of implicitly priorities, see Subsection~\ref{priority}. A strictly \glsxtrshort{fifo} rea
+A naive strictly \glsxtrshort{fifo} ready-queue does not offer sufficient performance. As shown in the evaluation sections, most production schedulers scale when adding multiple \glspl{hthrd} and that is not possible with a single point of contention. Therefore it is vital to shard the ready-queue so that multiple \glspl{hthrd} can access the ready-queue without performance degradation.
+\subsection{Sharding}
+\subsection{Sharding} \label{sec:sharding}
+An interesting approach to sharding a queue is presented in \cit{Trevors paper}. This algorithm represents a queue with relaxed \glsxtrshort{fifo} guarantee using an array of strictly \glsxtrshort{fifo} sublists as shown in Figure~\ref{fig:base}. Each cell of the array contains a linked-list with a lock and each node in these list is marked with a timestamp indicating when they were added to the list. Push operations are done by picking a random cell and attempting to push to its list. If the cell is already locked, the operation is simply retried on a new cell until a lock is acquired. Pop operations are done in a similar fashion except two random cells are picked. If both cells are not already locked and both cells contain non-empty lists, the operation pops the node with the oldest timestamp. If only one of the cell is unlocked and non-empty, the operation pops from that cell. If both cells are either locked or empty, the operation picks two new cells and tries again.
 \begin{figure}
 …
 \subsection{Finding threads}
+Once threads have been distributed onto multiple queues, indentifying which queues are empty and which aren't can become a problem.
+Indeed, if the number of \glspl{thrd} does not far exceed the number of queues, it is probable that several of these queues are empty.
+Figure~\ref{fig:empty} shows an example with 2 \glspl{thrd} running on 8 queues, where the chances of getting an empty queue is 75\% per pick, meaning two random picks yield a \gls{thrd} only half the time.
+Once threads have been distributed onto multiple queues, indentifying which queues are empty and which are not can become a problem. Indeed, if the number of \glspl{thrd} does not far exceed the number of queues, it is probable that several of these queues are empty. Figure~\ref{fig:empty} shows an example with 2 \glspl{thrd} running on 8 queues, where the chances of getting an empty queue is 75\% per pick, meaning two random picks yield a \gls{thrd} only half the time.
 …
 This can lead to performance problems since picks that do not yield a \gls{thrd} are not useful and do not necessarily help make more informed guesses.
 Solutions to this problem can take many forms, but they ultimately all have to encode where the threads are in some form. My results show that the density and locality of this encoding is generally the dominating factor in these scheme.
+Solutions to this problem can take many forms, but they ultimately all have to encode where the threads are in some form. My results show that the density and locality of this encoding is generally the dominating factor in these scheme. Classic solutions to this problem use one of three techniques to encode the information:
+\paragraph{Dense Information}
+\begin{figure}
+        \begin{center}
+                {\resizebox{0.73\textwidth}{!}{\input{emptybit.pstex_t}}}
+        \end{center}
+        \vspace*{-5pt}
+        \caption{Underloaded queue with added bitmask to indicate which array cells have items.}
+        \label{fig:emptybit}
+        \begin{center}
+                {\resizebox{0.73\textwidth}{!}{\input{emptytree.pstex_t}}}
+        \end{center}
+        \vspace*{-5pt}
+        \caption{Underloaded queue with added binary search tree indicate which array cells have items.}
+        \label{fig:emptytree}
+        \begin{center}
+                {\resizebox{0.9\textwidth}{!}{\input{emptytls.pstex_t}}}
+        \end{center}
+        \vspace*{-5pt}
+        \caption{Underloaded queue with added per processor bitmask to indicate which array cells have items.}
+        \label{fig:emptytls}
+\end{figure}
+\paragraph{Dense Information} Figure~\ref{fig:emptybit} shows a dense bitmask to identify which inner queues are currently in use. This approach means processors can often find \glspl{thrd} in constant time, regardless of how many underlying queues are empty. Furthermore, modern x86 CPUs have extended bit manipulation instructions (BMI2) that allow using the bitmask with very little overhead compared to the randomized selection approach for a filled ready queue, offering good performance even in cases with many empty inner queues. However, this technique has its limits: with a single word\footnote{Word refers here to however many bits can be written atomically.} bitmask, the total number of underlying queues in the ready queue is limited to the number of bits in the word. With a multi-word bitmask, this maximum limit can be increased arbitrarily, but the look-up will nolonger be constant time. Finally, a dense bitmap, either single or multi-word, causes additional contention problems which reduces performance because of cache misses after updates. This central update bottleneck also means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available \glspl{thrd} but none on queue.
+\paragraph{Sparse Information} Figure~\ref{fig:emptytree} shows an approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}. However, this approach may lead to poorer performance due to the inherent pointer chasing cost while still allowing more contention on the nodes of the tree if the tree is not deep enough.
+\paragraph{Local Information} Figure~\ref{fig:emptytls} shows an approach using dense information, similar to the bitmap, but have each thread keep its own independent copy of it. While this approach can offer good scalability \emph{and} low latency, the liveliness and discovery of the information can become a problem. This case is made worst in systems with few processors where even blind random picks can find \glspl{thrd} in few tries.
+I built a prototype of these approach and none of these techniques offer satisfying performance when few threads are present. All of these approach hit the same 2 problems. First, blindly picking two sub-queues is very fast which means that any improvement to the hit rate can easily be countered by a slow-down in look-up speed. Second, the array is already as sharded as is needed to avoid contention bottlenecks, so any denser data structure will tend to become a bottleneck. In all cases, these factors meant that the best cases scenerio, many threads, would get worst throughput and the worst case scenario, few threads, would get a better hit rate, but an equivalent throughput. As a result I tried an entirely different approach.
+\subsection{Dynamic Entropy}\cit{https://xkcd.com/2318/}
+In the worst case scenario there are few \glspl{thrd} ready to run, or more accuratly given $P$ \glspl{proc}, $T$ \glspl{thrd} and $\epsilon$, as usual, a very small number, in this case $\epsilon \ll P$, we have $T = P + \epsilon$. An important thing to note is that in this case, fairness is effectively irrelevant. Indeed, this case is close to \emph{actually matching} the model of the ``Ideal multi-tasking CPU'' presented in this chapter\footnote{For simplicity, this assumes there is a one-to-one match between \glspl{proc} and \glspl{hthrd}.}. Therefore, in this context it is possible to use a purely internal locality based approach and still meet the fairness requirements. This approach would simply have each \gls{proc} running a single \gls{thrd} repeatedly. Or from the shared ready-queue viewpoint, each \gls{proc} would push to a given sub-queue and then pop from the \emph{same} subqueue. Ideally, the scheduler would achieve this without affecting the fairness guarantees in cases where $T \gg P$.
+To achieve this, I use a communication channel I have not mentionned yet and which I believe I use in a novel way : the \glsxtrshort{prng}. If the scheduler has a \glsxtrshort{prng} instance per \gls{proc} exclusively used for scheduling, its seed effectively encodes a list of all the accessed subqueues, from the latest to the oldest. The only requirement to achieve this is to be able to ``replay'' the \glsxtrshort{prng} backwards. As it turns out, this is an entirely reasonnable requirement and there already exist \glsxtrshort{prng}s that are fast, compact \emph{and} can be run forward and backwards. Linear congruential generators\cite{wiki:lcg} are an example of \glsxtrshort{prng}s that match these requirements.
+The algorithm works as follows :
+\begin{itemize}
+        \item Each \gls{proc} has two \glsxtrshort{prng} instances, $F$ and $B$.
+        \item Push and Pop operations happen as mentionned in Section~\ref{sec:sharding} with the following exceptions:
+        \begin{itemize}
+                \item Push operations use $F$ going forward on each try and on success $F$ is copied into $B$.
+                \item Pop operations use $B$ going backwards on each try.
+        \end{itemize}
+\end{itemize}
+The main benefit of this technique is that it basically repects the desired properties of Figure~\ref{fig:fair}. When looking for work, \glspl{proc} will look first at the last cells they pushed to, if any, and then move backwards through the cells. As the \glspl{proc} continue looking for work, $F$ moves back and $B$ stays in place. As a result the relation between the two becomes weaker, which means that the probablisitic fairness of the algorithm reverts to normal. Chapter~\ref{proofs} discusses more formally the fairness guarantees of this algorithm.
+\section{Details}

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

rabc2a643	rc04a19e
42	42	\paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating tasks from one core to another can be . \cite{DBLP:journals/tpds/SquillanteL93}
43	43
44		\~~TODO~~{The survey is not great on this subject}
	44	\todo{The survey is not great on this subject}
45	45
46	46	\paragraph{Complex Machine Architecture} Another aspect that has been looked at is how well Work Stealing is applicable to different machine architectures.

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: