Context Navigation

← Previous Changeset
Next Changeset →

Changeset 7a0f798b

Timestamp:

Sep 6, 2022, 4:05:00 PM (4 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, ast-experimental, master, pthread-emulation, stuck-waitfor-destruct

Children:

Parents:

Message:

Merged peter's last changes and filled in most of the TODOs

Location:

doc/theses/thierry_delisle_PhD/thesis

Files:

: 7 edited

local.bib (modified) (1 diff)
text/conclusion.tex (modified) (1 diff)
text/core.tex (modified) (1 diff)
text/eval_macro.tex (modified) (2 diffs)
text/eval_micro.tex (modified) (8 diffs)
text/existing.tex (modified) (5 diffs)
text/front.tex (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/local.bib

-              r9f99799
+              r7a0f798b
+}
+% Trevor's relaxed FIFO list
+@inproceedings{alistarh2018relaxed,
+  title={Relaxed schedulers can efficiently parallelize iterative algorithms},
+  author={Alistarh, Dan and Brown, Trevor and Kopinsky, Justin and Nadiradze, Giorgi},
+  booktitle={Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing},
+  pages={377--386},
+  year={2018}
+}
+@article{zhuravlev2012survey,
+  title={Survey of energy-cognizant scheduling techniques},
+  author={Zhuravlev, Sergey and Saez, Juan Carlos and Blagodurov, Sergey and Fedorova, Alexandra and Prieto, Manuel},
+  journal={IEEE Transactions on Parallel and Distributed Systems},
+  volume={24},
+  number={7},
+  pages={1447--1464},
+  year={2012},
+  publisher={IEEE}
+}
+@article{vikranth2013topology,
+  title={Topology aware task stealing for on-chip NUMA multi-core processors},
+  author={Vikranth, BRWACRR and Wankar, Rajeev and Rao, C Raghavendra},
+  journal={Procedia Computer Science},
+  volume={18},
+  pages={379--388},
+  year={2013},
+  publisher={Elsevier}
+}
+@inproceedings{min2011hierarchical,
+  title={Hierarchical work stealing on manycore clusters},
+  author={Min, Seung-Jai and Iancu, Costin and Yelick, Katherine},
+  booktitle={Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11)},
+  volume={625},
+  year={2011},
+  organization={Citeseer}
+}
+@article{ribic2014energy,
+  title={Energy-efficient work-stealing language runtimes},
+  author={Ribic, Haris and Liu, Yu David},
+  journal={ACM SIGARCH Computer Architecture News},
+  volume={42},
+  number={1},
+  pages={513--528},
+  year={2014},
+  publisher={ACM New York, NY, USA}
+}
+@inproceedings{torng2016asymmetry,
+  title={Asymmetry-aware work-stealing runtimes},
+  author={Torng, Christopher and Wang, Moyang and Batten, Christopher},
+  booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
+  pages={40--52},
+  year={2016},
+  organization={IEEE}
+}
 % --------------------------------------------------

doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

-              r9f99799
+              r7a0f798b
 A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.
 However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge.
+% This challenge is made harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication.
+For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler is near optimal, \eg state-of-the-art work-stealing schedulers.
+For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler, \eg a state-of-the-art work-stealing scheduler, is near optimal.
 For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers.
+While I was aware of these realities, I underestimated how little performance margin there is for communication.
+Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of how thin the margin is.
 Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading.

doc/theses/thierry_delisle_PhD/thesis/text/core.tex

r9f99799	r7a0f798b
125	125
126	126	\subsection{Relaxed-FIFO}
127		A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \~~todo{cite Trevor's paper~~}.
	127	A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
128	128	This approach forgoes any ownership between \gls{proc} and subqueue, and simply creates a pool of ready-queues from which \glspl{proc} pick.
129	129	Scheduling is performed as follows:

doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

-              r9f99799
+              r7a0f798b
                 \label{fig:memcd:updt:vanilla:lat}
+        }
+        \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline \todo{Description}}
+        \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline On the left, throughput as Desired vs Actual query rate.
+        Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.
+        On the right, tail latency, \ie 99th Percentile of the response latency as a function of \emph{desired} query rate.
+        For throughput, higher is better, for tail-latency, lower is better.
+        Each series represent 15 independent runs, the dashed lines are maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
+        All runs have 15,360 client connections.
         \label{fig:memcd:updt}
 \end{figure}
 …
 However, for the following experiments, NGINX is configured to let the master process decided the appropriate number of threads.
-% Like memcached, NGINX can be made to use multiple \glspl{kthrd}.
-% It has a very similar architecture to the memcached architecture described in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic.
-% While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads.
-% Each worker thread handles multiple connections exclusively, effectively dividing the connections into distinct sets.
-% Again, this is effectively the \emph{event-based server} approach.
+%
-% \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/}
-% NGINX 1.13.7 is an high-performance, \emph{full-service}, event-driven, with multiple operating-system processes or multiple kernel-threads within a process to handle blocking I/O.
-% It can also .
-% Concurrent connections are handled using a complex event-driven architecture.
-% The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
-% NGINX uses a disk-based cache for performance, and assigns a dedicated process to manage the cache.
-% This process, known as the \textit{cache manager}, is spun-off by the master process.
-% Additionally, there can be many \textit{worker processes}, each handling network connections, reading and writing disk files, and communicating with upstream servers, such as reverse proxies or databases.
-% A worker is a single-threaded process, running independently of other workers.
-% The worker process handles new incoming connections and processes them.
-% Workers communicate using shared memory for shared cache and session data, and other shared resources.
-% Each worker assigns
-% As in a typical event-driven architecture, the worker listens for events from the clients, and responds immediately without blocking.
-% Memory use in NGINX is very conservative, because it does not spin up a new process or thread per connection, like Apache~\cite{apache} or \CFA.
-% All operations are asynchronous -- implemented using event notifications, callback functions and fine-tuned timers.
 \subsection{\CFA webserver}
 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront.

doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

-              r9f99799
+              r7a0f798b
 \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
 \label{fig:cycle:code}
-%\end{figure}ll have a physical key so it's not urgent.
 \bigskip
-%\begin{figure}
         \subfloat[][Throughput, 100 cycles per \proc]{
                 \resizebox{0.5\linewidth}{!}{
 …
 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the results for the churn experiment on Intel and AMD, respectively.
 Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc have, and all runtimes obtain fairly similar throughput for most \proc counts.
+Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc, and all runtimes obtain fairly similar throughput for most \proc counts.
 \CFA does very well on a single \proc but quickly loses its advantage over the other runtimes.
 As expected, it scales decently up to 48 \procs, drops from 48 to 72 \procs, and then plateaus.
 …
 Libfibre follows very closely behind with basically the same performance and scaling.
 Tokio maintains effectively the same curve shapes as \CFA and libfibre, but it incurs extra costs for all \proc counts.
-% As a result it is slightly outperformed by \CFA and libfibre.
 While Go maintains overall similar results to the others, it again encounters significant variation at high \proc counts.
 Inexplicably resulting in super-linear scaling for some runs, \ie the scalability curves displays a negative slope.
 …
 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
 The locality experiment includes two variations of the churn benchmark, where a data array is added.
 In both variations, before @V@ing the semaphore, each \at increments random cells inside the data array by calling a @work@ function.
+In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
 In the noshare variation, the array is not passed on and each thread continuously accesses its private array.
 In the share variation, the array is passed to another thread via the semaphore's shadow-queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread.
 …
 In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
 In the shared variation, unparking the \at on a remote \proc is an appropriate choice.
-\todo{PAB: I changed these sentences around.}
 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy.
 …
 This scenario is a harder case to handle because corrective measures must be taken even when work is available.
 Note, runtimes with preemption circumvent this problem by forcing the spinner to yield.
+In \CFA preemption was disabled as it only obfuscates the results.
+I am not aware of a method to disable preemption in Go.
 In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
 …
 The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged.
 The yielding variation is denoted ``Yield''.
 The experiment is only run for many \procs, since scaling is not the focus of this experiment.
+The experiment is only run for few and many \procs, since scaling is not the focus of this experiment.
 The first two columns show the results for the semaphore variation on Intel.
 …
 Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
 \CFA achieves better latencies, presumably due to no synchronization with the yield.
-\todo{PAB: what about \CFA preemption? How does that come into play for your scheduler?}
 Go does complete the experiment, but with drastically higher latency:
 latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs.
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption\footnote{
+Preemption is done at the function prolog when the goroutine's stack is increasing;
+whereas \CFA uses fine-grain preemption between any two instructions.}
+This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption
 , which interrupts the spinning leader after a period.
 Neither Libfibre or Tokio complete the experiment.

doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

-              r9f99799
+              r7a0f798b
 \subsection{Explicitly Informed Dynamic Schedulers}
 While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.
+When available, a scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information}
+However, most programmers do not determine or even \emph{predict} this information;
+When available, a scheduler can then use this information to direct the scheduling decisions.
+For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
+However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
 Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.
 …
 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
+\paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \ats from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93}
+\todo{The survey is not great on this subject}
+\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
+In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
+However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
+An common example where this is statically true is architectures with \acrshort{numa}.
+In these cases, it can be relevant to change the scheduler to be cognizent of the topology~\cite{vikranth2013topology,min2011hierarchical}.
+Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}.
 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures.
+This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
+As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis.
+Althought it could be an interesting avenue for future work.
 \subsection{Theoretical Results}
 …
 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
 In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs''\todo{Look up section number.} discusses the scheduling policy more in depth.
+In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth.
 Multicore scheduling is based on a combination of priorities and preferred \proc.
 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
 …
 \end{displayquote}
+\todo{load balancing}
+There is very little documentation on the internals of this scheduler.
+However, the documentation do describe a feature set that is very similar to the Windows and Linux OS schedulers.
+Presumably, this means that the internals are also fairly similar overall.
 \subsection{User-Level Schedulers}
 …
 While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests a fairly classic approach.
 Each \proc has a queue of \ats to run, called \newterm{blocks}, which are drained in \glsxtrshort{fifo}.
+\todo{update: They seem to add the concept of dependent queues with clear ordering, where executing a block ends-up scheduling more blocks.
+In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.}
+GCD also has secondary queues, called \newterm{Dispatch Queues} with clear ordering, where executing a block ends-up scheduling more blocks.
+In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.
+The similarity of API and semantics between GCD and Intel\textregistered ~TBB suggest the underlying scheduling algorithms are similar.
 \paragraph{LibFibre}

doc/theses/thierry_delisle_PhD/thesis/text/front.tex

-              r9f99799
+              r7a0f798b
 \noindent
         The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote.
+        \todo{External Examiners}
+\bigskip
 \noindent
+\begin{tabbing}
         Internal-External Member: \=  \kill % using longest text to define tab length
         External Examiner: \>  TBD \\
         \> TBD \\
+\bigskip
+\noindent
+\begin{tabbing}
+        Internal-External Member: \=  \kill % using longest text to define tab length
+        External Examiner: \>  Doug Lea \\
+        \> Professor and Department Chair, Computer Science Department \\
+        \> State University of New York at Oswego \\
 \end{tabbing}
 \bigskip
 …
 \begin{tabbing}
         Internal-External Member: \=  \kill % using longest text to define tab length
         Internal-External Member: \> TBD \\
         \> TBD \\
+        Internal-External Member: \> Patrick Lam \\
+        \> Associate Professor, Department of Electrical and Computer Engineering \\
         \> University of Waterloo \\
 \end{tabbing}

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: