Changeset 7a0f798b for doc/theses/thierry_delisle_PhD/thesis
- Timestamp:
- Sep 6, 2022, 4:05:00 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- a44514e
- Parents:
- 9f99799
- Location:
- doc/theses/thierry_delisle_PhD/thesis
- Files:
-
- 7 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/local.bib
r9f99799 r7a0f798b 458 458 } 459 459 460 460 % Trevor's relaxed FIFO list 461 @inproceedings{alistarh2018relaxed, 462 title={Relaxed schedulers can efficiently parallelize iterative algorithms}, 463 author={Alistarh, Dan and Brown, Trevor and Kopinsky, Justin and Nadiradze, Giorgi}, 464 booktitle={Proceedings of the 2018 ACM Symposium on Principles of Distributed Computing}, 465 pages={377--386}, 466 year={2018} 467 } 468 469 @article{zhuravlev2012survey, 470 title={Survey of energy-cognizant scheduling techniques}, 471 author={Zhuravlev, Sergey and Saez, Juan Carlos and Blagodurov, Sergey and Fedorova, Alexandra and Prieto, Manuel}, 472 journal={IEEE Transactions on Parallel and Distributed Systems}, 473 volume={24}, 474 number={7}, 475 pages={1447--1464}, 476 year={2012}, 477 publisher={IEEE} 478 } 479 480 @article{vikranth2013topology, 481 title={Topology aware task stealing for on-chip NUMA multi-core processors}, 482 author={Vikranth, BRWACRR and Wankar, Rajeev and Rao, C Raghavendra}, 483 journal={Procedia Computer Science}, 484 volume={18}, 485 pages={379--388}, 486 year={2013}, 487 publisher={Elsevier} 488 } 489 490 @inproceedings{min2011hierarchical, 491 title={Hierarchical work stealing on manycore clusters}, 492 author={Min, Seung-Jai and Iancu, Costin and Yelick, Katherine}, 493 booktitle={Fifth Conference on Partitioned Global Address Space Programming Models (PGAS11)}, 494 volume={625}, 495 year={2011}, 496 organization={Citeseer} 497 } 498 499 @article{ribic2014energy, 500 title={Energy-efficient work-stealing language runtimes}, 501 author={Ribic, Haris and Liu, Yu David}, 502 journal={ACM SIGARCH Computer Architecture News}, 503 volume={42}, 504 number={1}, 505 pages={513--528}, 506 year={2014}, 507 publisher={ACM New York, NY, USA} 508 } 509 510 @inproceedings{torng2016asymmetry, 511 title={Asymmetry-aware work-stealing runtimes}, 512 author={Torng, Christopher and Wang, Moyang and Batten, Christopher}, 513 booktitle={2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)}, 514 pages={40--52}, 515 year={2016}, 516 organization={IEEE} 517 } 461 518 462 519 % -------------------------------------------------- -
doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex
r9f99799 r7a0f798b 13 13 A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication. 14 14 However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge. 15 % This challenge is made harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication. 16 For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler is near optimal, \eg state-of-the-art work-stealing schedulers. 15 For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler, \eg a state-of-the-art work-stealing scheduler, is near optimal. 17 16 For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers. 17 While I was aware of these realities, I underestimated how little performance margin there is for communication. 18 Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of how thin the margin is. 18 19 19 20 Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading. -
doc/theses/thierry_delisle_PhD/thesis/text/core.tex
r9f99799 r7a0f798b 125 125 126 126 \subsection{Relaxed-FIFO} 127 A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \ todo{cite Trevor's paper}.127 A different scheduling approach is to create a ``relaxed-FIFO'' queue, as in \cite{alistarh2018relaxed}. 128 128 This approach forgoes any ownership between \gls{proc} and subqueue, and simply creates a pool of ready-queues from which \glspl{proc} pick. 129 129 Scheduling is performed as follows: -
doc/theses/thierry_delisle_PhD/thesis/text/eval_macro.tex
r9f99799 r7a0f798b 166 166 \label{fig:memcd:updt:vanilla:lat} 167 167 } 168 \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline \todo{Description}} 168 \caption[Throughput and Latency results at different update rates (percentage of writes).]{Throughput and Latency results at different update rates (percentage of writes).\smallskip\newline On the left, throughput as Desired vs Actual query rate. 169 Target QPS is the query rate that the clients are attempting to maintain and Actual QPS is the rate at which the server is able to respond. 170 On the right, tail latency, \ie 99th Percentile of the response latency as a function of \emph{desired} query rate. 171 For throughput, higher is better, for tail-latency, lower is better. 172 Each series represent 15 independent runs, the dashed lines are maximums of each series while the solid lines are the median and the dotted lines are the minimums.} 173 All runs have 15,360 client connections. 169 174 \label{fig:memcd:updt} 170 175 \end{figure} … … 194 199 However, for the following experiments, NGINX is configured to let the master process decided the appropriate number of threads. 195 200 196 197 % Like memcached, NGINX can be made to use multiple \glspl{kthrd}.198 % It has a very similar architecture to the memcached architecture described in Section~\ref{memcd:thrd}, where multiple \glspl{kthrd} each run a mostly independent network logic.199 % While it does not necessarily use a dedicated listening thread, each connection is arbitrarily assigned to one of the \newterm{worker} threads.200 % Each worker thread handles multiple connections exclusively, effectively dividing the connections into distinct sets.201 % Again, this is effectively the \emph{event-based server} approach.202 %203 % \cit{https://www.nginx.com/blog/inside-nginx-how-we-designed-for-performance-scale/}204 205 % NGINX 1.13.7 is an high-performance, \emph{full-service}, event-driven, with multiple operating-system processes or multiple kernel-threads within a process to handle blocking I/O.206 % It can also .207 % Concurrent connections are handled using a complex event-driven architecture.208 % The NGINX server runs a master process that performs operations such as reading configuration files, binding to ports, and controlling worker processes.209 % NGINX uses a disk-based cache for performance, and assigns a dedicated process to manage the cache.210 % This process, known as the \textit{cache manager}, is spun-off by the master process.211 % Additionally, there can be many \textit{worker processes}, each handling network connections, reading and writing disk files, and communicating with upstream servers, such as reverse proxies or databases.212 213 % A worker is a single-threaded process, running independently of other workers.214 % The worker process handles new incoming connections and processes them.215 % Workers communicate using shared memory for shared cache and session data, and other shared resources.216 % Each worker assigns217 % As in a typical event-driven architecture, the worker listens for events from the clients, and responds immediately without blocking.218 % Memory use in NGINX is very conservative, because it does not spin up a new process or thread per connection, like Apache~\cite{apache} or \CFA.219 % All operations are asynchronous -- implemented using event notifications, callback functions and fine-tuned timers.220 221 201 \subsection{\CFA webserver} 222 202 The \CFA webserver is a straightforward thread-per-connection webserver, where a fixed number of \ats are created upfront. -
doc/theses/thierry_delisle_PhD/thesis/text/eval_micro.tex
r9f99799 r7a0f798b 101 101 \caption[Cycle Benchmark : Pseudo Code]{Cycle Benchmark : Pseudo Code} 102 102 \label{fig:cycle:code} 103 %\end{figure}ll have a physical key so it's not urgent.104 103 \bigskip 105 %\begin{figure}106 104 \subfloat[][Throughput, 100 cycles per \proc]{ 107 105 \resizebox{0.5\linewidth}{!}{ … … 401 399 402 400 Figures~\ref{fig:churn:jax} and Figure~\ref{fig:churn:nasus} show the results for the churn experiment on Intel and AMD, respectively. 403 Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc have, and all runtimes obtain fairly similar throughput for most \proc counts.401 Looking at the left column on Intel, Figures~\ref{fig:churn:jax:ops} and \ref{fig:churn:jax:ns} show the results for 100 \ats for each \proc, and all runtimes obtain fairly similar throughput for most \proc counts. 404 402 \CFA does very well on a single \proc but quickly loses its advantage over the other runtimes. 405 403 As expected, it scales decently up to 48 \procs, drops from 48 to 72 \procs, and then plateaus. … … 425 423 Libfibre follows very closely behind with basically the same performance and scaling. 426 424 Tokio maintains effectively the same curve shapes as \CFA and libfibre, but it incurs extra costs for all \proc counts. 427 % As a result it is slightly outperformed by \CFA and libfibre.428 425 While Go maintains overall similar results to the others, it again encounters significant variation at high \proc counts. 429 426 Inexplicably resulting in super-linear scaling for some runs, \ie the scalability curves displays a negative slope. … … 497 494 It is also possible to unpark to a third unrelated ready-queue, but without additional knowledge about the situation, it is likely to degrade performance.} 498 495 The locality experiment includes two variations of the churn benchmark, where a data array is added. 499 In both variations, before @V@ing the semaphore, each \at increments random cells inside the data array by calling a @work@ function.496 In both variations, before @V@ing the semaphore, each \at calls a @work@ function which increments random cells inside the data array. 500 497 In the noshare variation, the array is not passed on and each thread continuously accesses its private array. 501 498 In the share variation, the array is passed to another thread via the semaphore's shadow-queue (each blocking thread can save a word of user data in its blocking node), transferring ownership of the array to the woken thread. … … 506 503 In the noshare variation, unparking the \at on the local \proc is an appropriate choice since the data was last modified on that \proc. 507 504 In the shared variation, unparking the \at on a remote \proc is an appropriate choice. 508 \todo{PAB: I changed these sentences around.}509 505 510 506 The expectation for this benchmark is to see a performance inversion, where runtimes fare notably better in the variation which matches their unparking policy. … … 720 716 This scenario is a harder case to handle because corrective measures must be taken even when work is available. 721 717 Note, runtimes with preemption circumvent this problem by forcing the spinner to yield. 718 In \CFA preemption was disabled as it only obfuscates the results. 719 I am not aware of a method to disable preemption in Go. 722 720 723 721 In both variations, the experiment effectively measures how long it takes for all \ats to run once after a given synchronization point. … … 763 761 The semaphore variation is denoted ``Park'', where the number of \ats dwindles down as the new leader is acknowledged. 764 762 The yielding variation is denoted ``Yield''. 765 The experiment is only run for many \procs, since scaling is not the focus of this experiment.763 The experiment is only run for few and many \procs, since scaling is not the focus of this experiment. 766 764 767 765 The first two columns show the results for the semaphore variation on Intel. … … 771 769 Looking at the next two columns, the results for the yield variation on Intel, the story is very different. 772 770 \CFA achieves better latencies, presumably due to no synchronization with the yield. 773 \todo{PAB: what about \CFA preemption? How does that come into play for your scheduler?}774 771 Go does complete the experiment, but with drastically higher latency: 775 772 latency at 2 \procs is $350\times$ higher than \CFA and $70\times$ higher at 192 \procs. 776 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption\footnote{ 777 Preemption is done at the function prolog when the goroutine's stack is increasing; 778 whereas \CFA uses fine-grain preemption between any two instructions.} 773 This difference is because Go has a classic work-stealing scheduler, but it adds coarse-grain preemption 779 774 , which interrupts the spinning leader after a period. 780 775 Neither Libfibre or Tokio complete the experiment. -
doc/theses/thierry_delisle_PhD/thesis/text/existing.tex
r9f99799 r7a0f798b 35 35 \subsection{Explicitly Informed Dynamic Schedulers} 36 36 While dynamic schedulers may not have an exhaustive list of dependencies for a \ats, some information may be available about each \ats, \eg expected duration, required resources, relative importance, \etc. 37 When available, a scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information} 38 However, most programmers do not determine or even \emph{predict} this information; 37 When available, a scheduler can then use this information to direct the scheduling decisions. 38 For example, when scheduling in a cloud computing context, \ats will commonly have extra information that was manually entered, \eg caps on compute time or \io usage. 39 However, in the context of user-level threading, most programmers do not determine or even \emph{predict} this information; 39 40 at best, the scheduler has only some imprecise information provided by the programmer, \eg, indicating a \ats takes approximately 3--7 seconds to complete, rather than exactly 5 seconds. 40 41 Providing this kind of information is a significant programmer burden especially if the information does not scale with the number of \ats and their complexity. … … 78 79 Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained. 79 80 80 \paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \ats from one core to another can be . \cite{DBLP:journals/tpds/SquillanteL93} 81 82 \todo{The survey is not great on this subject} 81 \paragraph{Task Placement} Another aspect of work stealing that has been studied extensively is the mapping between \at and \proc. 82 In its simplest form, work stealing assumes that all \procs are interchangeable and therefore the mapping between \at and \proc is not interesting. 83 However, in real-life architectures there are contexts where different \procs can have different characteristics, which makes some mapping more interesting than others. 84 An common example where this is statically true is architectures with \acrshort{numa}. 85 In these cases, it can be relevant to change the scheduler to be cognizent of the topology~\cite{vikranth2013topology,min2011hierarchical}. 86 Another example is energy usage, where the scheduler is modified to optimize for energy efficiency in addition/instead of performance~\cite{ribic2014energy,torng2016asymmetry}. 83 87 84 88 \paragraph{Complex Machine Architecture} Another aspect that has been examined is how well work stealing is applicable to different machine architectures. 89 This is arguably strongly related to Task Placement but extends into more heterogeneous architectures. 90 As \CFA offers no particular support for heterogeneous architecture, this is also a area that is less relevant to this thesis. 91 Althought it could be an interesting avenue for future work. 85 92 86 93 \subsection{Theoretical Results} … … 136 143 The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. 137 144 138 In~\cite{russinovich2009windows}, Chapter 1 section ``Processes, Threads, and Jobs''\todo{Look up section number.}discusses the scheduling policy more in depth.145 In~\cite{russinovich2009windows}, Chapter 1 section 2.3 ``Processes, Threads, and Jobs'' discusses the scheduling policy more in depth. 139 146 Multicore scheduling is based on a combination of priorities and preferred \proc. 140 147 Each \at is assigned an initial processor using a round-robin policy, called the \at's \newterm{ideal} \proc. … … 152 159 \end{displayquote} 153 160 154 \todo{load balancing} 161 There is very little documentation on the internals of this scheduler. 162 However, the documentation do describe a feature set that is very similar to the Windows and Linux OS schedulers. 163 Presumably, this means that the internals are also fairly similar overall. 155 164 156 165 \subsection{User-Level Schedulers} … … 208 217 While the documentation only gives limited insight into the scheduling and load balancing approach, \cite{apple:gcd2} suggests a fairly classic approach. 209 218 Each \proc has a queue of \ats to run, called \newterm{blocks}, which are drained in \glsxtrshort{fifo}. 210 \todo{update: They seem to add the concept of dependent queues with clear ordering, where executing a block ends-up scheduling more blocks. 211 In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics.} 212 219 GCD also has secondary queues, called \newterm{Dispatch Queues} with clear ordering, where executing a block ends-up scheduling more blocks. 220 In terms of semantics, these Dispatch Queues seem to be very similar to Intel\textregistered ~TBB \lstinline{execute()} and predecessor semantics. 221 222 The similarity of API and semantics between GCD and Intel\textregistered ~TBB suggest the underlying scheduling algorithms are similar. 213 223 214 224 \paragraph{LibFibre} -
doc/theses/thierry_delisle_PhD/thesis/text/front.tex
r9f99799 r7a0f798b 60 60 \noindent 61 61 The following served on the Examining Committee for this thesis. The decision of the Examining Committee is by majority vote. 62 \todo{External Examiners} 63 \bigskip 64 65 \ noindent66 \begin{tabbing} 67 Internal-External Member: \= \kill % using longest text to define tab length68 External Examiner: \> TBD\\69 \> TBD\\62 \bigskip 63 64 \noindent 65 \begin{tabbing} 66 Internal-External Member: \= \kill % using longest text to define tab length 67 External Examiner: \> Doug Lea \\ 68 \> Professor and Department Chair, Computer Science Department \\ 69 \> State University of New York at Oswego \\ 70 70 \end{tabbing} 71 71 \bigskip … … 96 96 \begin{tabbing} 97 97 Internal-External Member: \= \kill % using longest text to define tab length 98 Internal-External Member: \> TBD\\99 \> TBD\\98 Internal-External Member: \> Patrick Lam \\ 99 \> Associate Professor, Department of Electrical and Computer Engineering \\ 100 100 \> University of Waterloo \\ 101 101 \end{tabbing}
Note: See TracChangeset
for help on using the changeset viewer.