Ignore:
Timestamp:
Sep 6, 2022, 4:05:00 PM (2 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
a44514e
Parents:
9f99799
Message:

Merged peter's last changes and filled in most of the TODOs

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
7 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/local.bib

    r9f99799 r7a0f798b  
    458458}
    459459
    460 
     460% Trevor's relaxed FIFO list
     461@inproceedings{alistarh2018relaxed,
     462  title={Relaxed schedulers can efficiently parallelize iterative algorithms},
     463  author={Alistarh, Dan and Brown, Trevor and Kopinsky, Justin and Nadiradze, Giorgi},
     464  booktitle={Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing},
     465  pages={377--386},
     466  year={2018}
     467}
     468
     469@article{zhuravlev2012survey,
     470  title={Survey of energy-cognizant scheduling techniques},
     471  author={Zhuravlev, Sergey and Saez, Juan Carlos and Blagodurov, Sergey and Fedorova, Alexandra and Prieto, Manuel},
     472  journal={IEEE Transactions on Parallel and Distributed Systems},
     473  volume={24},
     474  number={7},
     475  pages={1447--1464},
     476  year={2012},
     477  publisher={IEEE}
     478}
     479
     480@article{vikranth2013topology,
     481  title={Topology aware task stealing for on-chip NUMA multi-core processors},
     482  author={Vikranth, BRWACRR and Wankar, Rajeev and Rao, C Raghavendra},
     483  journal={Procedia Computer Science},
     484  volume={18},
     485  pages={379--388},
     486  year={2013},
     487  publisher={Elsevier}
     488}
     489
     490@inproceedings{min2011hierarchical,
     491  title={Hierarchical work stealing on manycore clusters},
     492  author={Min, Seung-Jai and Iancu, Costin and Yelick, Katherine},
     493  booktitle={Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11)},
     494  volume={625},
     495  year={2011},
     496  organization={Citeseer}
     497}
     498
     499@article{ribic2014energy,
     500  title={Energy-efficient work-stealing language runtimes},
     501  author={Ribic, Haris and Liu, Yu David},
     502  journal={ACM SIGARCH Computer Architecture News},
     503  volume={42},
     504  number={1},
     505  pages={513--528},
     506  year={2014},
     507  publisher={ACM New York, NY, USA}
     508}
     509
     510@inproceedings{torng2016asymmetry,
     511  title={Asymmetry-aware work-stealing runtimes},
     512  author={Torng, Christopher and Wang, Moyang and Batten, Christopher},
     513  booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)},
     514  pages={40--52},
     515  year={2016},
     516  organization={IEEE}
     517}
    461518
    462519% --------------------------------------------------
  • doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

    r9f99799 r7a0f798b  
    1313A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.
    1414However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge.
    15 % This challenge is made harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication.
    16 For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler is near optimal, \eg state-of-the-art work-stealing schedulers.
     15For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler, \eg a state-of-the-art work-stealing scheduler, is near optimal.
    1716For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers.
     17While I was aware of these realities, I underestimated how little performance margin there is for communication.
     18Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of how thin the margin is.
    1819
    1920Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading.
  • doc/theses/thierry_delisle_PhD/thesis/text/core.tex

    r9f99799 r7a0f798b  
    125125
    126126\subsection{Relaxed-FIFO}
    127 A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \todo{cite Trevor's paper}.
     127A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}.
    128128This approach forgoes any ownership between \gls{proc} and subqueue, and simply creates a pool of ready-queues from which \glspl{proc} pick.
    129129Scheduling is performed as follows:
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex

    r9f99799 r7a0f798b  
    166166                \label{fig:memcd:updt:vanilla:lat}
    167167        }
    168         \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline \todo{Description}}
     168        \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline On the left, throughput as Desired vs Actual query rate.
     169        Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond.
     170        On the right, tail latency, \ie 99th Percentile of the response latency as a function of \emph{desired} query rate.
     171        For throughput, higher is better, for tail-latency, lower is better.
     172        Each series represent 15 independent runs, the dashed lines are maximums of each series while the solid lines are the median and the dotted lines are the minimums.}
     173        All runs have 15,360 client connections.
    169174        \label{fig:memcd:updt}
    170175\end{figure}
     
    194199However, for the following experiments, NGINX is configured to let the master process decided the appropriate number of threads.
    195200
    196 
    197 % Like memcached, NGINX can be made to use multiple \glspl{kthrd}.
    198 % It has a very similar architecture to the memcached architecture described in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic.
    199 % While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads.
    200 % Each worker thread handles multiple connections exclusively, effectively dividing the connections into distinct sets.
    201 % Again, this is effectively the \emph{event-based server} approach.
    202 %
    203 % \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/}
    204 
    205 % NGINX 1.13.7 is an high-performance, \emph{full-service}, event-driven, with multiple operating-system processes or multiple kernel-threads within a process to handle blocking I/O.
    206 % It can also .
    207 % Concurrent connections are handled using a complex event-driven architecture.
    208 % The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.
    209 % NGINX uses a disk-based cache for performance, and assigns a dedicated process to manage the cache.
    210 % This process, known as the \textit{cache manager}, is spun-off by the master process.
    211 % Additionally, there can be many \textit{worker processes}, each handling network connections, reading and writing disk files, and communicating with upstream servers, such as reverse proxies or databases.
    212 
    213 % A worker is a single-threaded process, running independently of other workers.
    214 % The worker process handles new incoming connections and processes them.
    215 % Workers communicate using shared memory for shared cache and session data, and other shared resources.
    216 % Each worker assigns
    217 % As in a typical event-driven architecture, the worker listens for events from the clients, and responds immediately without blocking.
    218 % Memory use in NGINX is very conservative, because it does not spin up a new process or thread per connection, like Apache~\cite{apache} or \CFA.
    219 % All operations are asynchronous -- implemented using event notifications, callback functions and fine-tuned timers.
    220 
    221201\subsection{\CFA webserver}
    222202The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront.
  • doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex

    r9f99799 r7a0f798b  
    101101\caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code}
    102102\label{fig:cycle:code}
    103 %\end{figure}ll have a physical key so it's not urgent.
    104103\bigskip
    105 %\begin{figure}
    106104        \subfloat[][Throughput, 100 cycles per \proc]{
    107105                \resizebox{0.5\linewidth}{!}{
     
    401399
    402400Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the results for the churn experiment on Intel and AMD, respectively.
    403 Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc have, and all runtimes obtain fairly similar throughput for most \proc counts.
     401Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc, and all runtimes obtain fairly similar throughput for most \proc counts.
    404402\CFA does very well on a single \proc but quickly loses its advantage over the other runtimes.
    405403As expected, it scales decently up to 48 \procs, drops from 48 to 72 \procs, and then plateaus.
     
    425423Libfibre follows very closely behind with basically the same performance and scaling.
    426424Tokio maintains effectively the same curve shapes as \CFA and libfibre, but it incurs extra costs for all \proc counts.
    427 % As a result it is slightly outperformed by \CFA and libfibre.
    428425While Go maintains overall similar results to the others, it again encounters significant variation at high \proc counts.
    429426Inexplicably resulting in super-linear scaling for some runs, \ie the scalability curves displays a negative slope.
     
    497494It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.}
    498495The locality experiment includes two variations of the churn benchmark, where a data array is added.
    499 In both variations, before @V@ing the semaphore, each \at increments random cells inside the data array by calling a @work@ function.
     496In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array.
    500497In the noshare variation, the array is not passed on and each thread continuously accesses its private array.
    501498In the share variation, the array is passed to another thread via the semaphore's shadow-queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread.
     
    506503In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc.
    507504In the shared variation, unparking the \at on a remote \proc is an appropriate choice.
    508 \todo{PAB: I changed these sentences around.}
    509505
    510506The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy.
     
    720716This scenario is a harder case to handle because corrective measures must be taken even when work is available.
    721717Note, runtimes with preemption circumvent this problem by forcing the spinner to yield.
     718In \CFA preemption was disabled as it only obfuscates the results.
     719I am not aware of a method to disable preemption in Go.
    722720
    723721In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point.
     
    763761The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged.
    764762The yielding variation is denoted ``Yield''.
    765 The experiment is only run for many \procs, since scaling is not the focus of this experiment.
     763The experiment is only run for few and many \procs, since scaling is not the focus of this experiment.
    766764
    767765The first two columns show the results for the semaphore variation on Intel.
     
    771769Looking at the next two columns, the results for the yield variation on Intel, the story is very different.
    772770\CFA achieves better latencies, presumably due to no synchronization with the yield.
    773 \todo{PAB: what about \CFA preemption? How does that come into play for your scheduler?}
    774771Go does complete the experiment, but with drastically higher latency:
    775772latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs.
    776 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption\footnote{
    777 Preemption is done at the function prolog when the goroutine's stack is increasing;
    778 whereas \CFA uses fine-grain preemption between any two instructions.}
     773This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption
    779774, which interrupts the spinning leader after a period.
    780775Neither Libfibre or Tokio complete the experiment.
  • doc/theses/thierry_delisle_PhD/thesis/text/existing.tex

    r9f99799 r7a0f798b  
    3535\subsection{Explicitly Informed Dynamic Schedulers}
    3636While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc.
    37 When available, a scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information}
    38 However, most programmers do not determine or even \emph{predict} this information;
     37When available, a scheduler can then use this information to direct the scheduling decisions.
     38For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage.
     39However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information;
    3940at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds.
    4041Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity.
     
    7879Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained.
    7980
    80 \paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \ats from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93}
    81 
    82 \todo{The survey is not great on this subject}
     81\paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc.
     82In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting.
     83However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others.
     84An common example where this is statically true is architectures with \acrshort{numa}.
     85In these cases, it can be relevant to change the scheduler to be cognizent of the topology~\cite{vikranth2013topology,min2011hierarchical}.
     86Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}.
    8387
    8488\paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures.
     89This is arguably strongly related to Task Placement but extends into more heterogeneous architectures.
     90As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis.
     91Althought it could be an interesting avenue for future work.
    8592
    8693\subsection{Theoretical Results}
     
    136143The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests.
    137144
    138 In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs''\todo{Look up section number.} discusses the scheduling policy more in depth.
     145In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth.
    139146Multicore scheduling is based on a combination of priorities and preferred \proc.
    140147Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc.
     
    152159\end{displayquote}
    153160
    154 \todo{load balancing}
     161There is very little documentation on the internals of this scheduler.
     162However, the documentation do describe a feature set that is very similar to the Windows and Linux OS schedulers.
     163Presumably, this means that the internals are also fairly similar overall.
    155164
    156165\subsection{User-Level Schedulers}
     
    208217While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests a fairly classic approach.
    209218Each \proc has a queue of \ats to run, called \newterm{blocks}, which are drained in \glsxtrshort{fifo}.
    210 \todo{update: They seem to add the concept of dependent queues with clear ordering, where executing a block ends-up scheduling more blocks.
    211 In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.}
    212 
     219GCD also has secondary queues, called \newterm{Dispatch Queues} with clear ordering, where executing a block ends-up scheduling more blocks.
     220In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.
     221
     222The similarity of API and semantics between GCD and Intel\textregistered ~TBB suggest the underlying scheduling algorithms are similar.
    213223
    214224\paragraph{LibFibre}
  • doc/theses/thierry_delisle_PhD/thesis/text/front.tex

    r9f99799 r7a0f798b  
    6060\noindent
    6161        The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote.
    62         \todo{External Examiners}
    63 \bigskip
    64 
    65 \noindent
    66 \begin{tabbing}
    67         Internal-External Member: \=  \kill % using longest text to define tab length
    68         External Examiner: \>  TBD \\
    69         \> TBD \\
     62\bigskip
     63
     64\noindent
     65\begin{tabbing}
     66        Internal-External Member: \=  \kill % using longest text to define tab length
     67        External Examiner: \>  Doug Lea \\
     68        \> Professor and Department Chair, Computer Science Department \\
     69        \> State University of New York at Oswego \\
    7070\end{tabbing}
    7171\bigskip
     
    9696\begin{tabbing}
    9797        Internal-External Member: \=  \kill % using longest text to define tab length
    98         Internal-External Member: \> TBD \\
    99         \> TBD \\
     98        Internal-External Member: \> Patrick Lam \\
     99        \> Associate Professor, Department of Electrical and Computer Engineering \\
    100100        \> University of Waterloo \\
    101101\end{tabbing}
Note: See TracChangeset for help on using the changeset viewer.