- Timestamp:
- Sep 12, 2022, 4:35:50 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- fc96890
- Parents:
- 4ab54c9
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex
r4ab54c9 r1c334d1 2 2 3 3 Building the \CFA runtime has been a challenging project. 4 The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).4 The work was divided between high-level concurrency design and a user-level threading runtime (Masters's thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis). 5 5 Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. 6 This continuity provides a consistent approach to advanced control -flow and concurrency, with easier development, management and maintenance of the runtime in the future.6 This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future. 7 7 8 I believed my Masters work would provide the background to make the Ph.Dwork reasonably straightforward.8 I believed my Masters's work would provide the background to make the Ph.D. work reasonably straightforward. 9 9 However, I discovered two significant challenges. 10 10 11 First, modern symmetric multiprocessing CPU have significant performance penalties for communicationm, often cacherelated.12 A SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs.13 A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.11 First, modern symmetric multiprocessing CPUs have significant performance penalties for communication, often cache-related. 12 An SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs. 13 An MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication. 14 14 However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge. 15 15 For balanced workloads with little or no data sharing, \ie embarrassingly parallel, an MQMS scheduler is near optimal, \eg a state-of-the-art work-stealing scheduler. … … 18 18 Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of the thin communication margin. 19 19 20 Second, the kernel locking, threading, and I/O in the Linux operating system offer s very little flexibility,and are not designed to facilitate user-level threading.21 There are multiple concurrency aspects in Linux that require carefully following a strict procedure in orderto achieve acceptable performance.20 Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading. 21 There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance. 22 22 To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing. 23 It is unfortunate thatlittle has changed in the intervening years.23 Unfortunately, little has changed in the intervening years. 24 24 25 Also, my decision to use @io_uring@ was both apositive and negative.25 Also, my decision to use @io_uring@ was both positive and negative. 26 26 The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; 27 27 hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. 28 Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as adetailed knowledge of multiple I/O mechanisms.28 Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms. 29 29 The negative is that @io_uring@ is new and developing. 30 30 As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. … … 33 33 It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with. 34 34 Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried. 35 To preservethe illusion of asynchronicity requires delegating these operations to kernel threads.35 Preserving the illusion of asynchronicity requires delegating these operations to kernel threads. 36 36 This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking. 37 Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system.37 Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system. 38 38 Nonblocking I/O should not be handled in this way. 39 39 40 40 \section{Goals} 41 This work focus ses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers.41 This work focuses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. 42 42 The levels of indirection to the CPUs are: 43 43 \begin{itemize} … … 49 49 The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. 50 50 \end{itemize} 51 The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decision as a result.51 The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decisions as a result. 52 52 Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. 53 53 … … 61 61 Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. 62 62 63 This thesis achieves its stated contribut es by presenting:63 This thesis achieves its stated contributions by presenting: 64 64 \begin{enumerate}[leftmargin=*] 65 65 \item … … 72 72 These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. 73 73 \end{enumerate} 74 Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state -changes is low.74 Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low. 75 75 76 76 \section{Future Work} 77 While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.77 While the \CFA runtime achieves a better compromise, than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. 78 78 Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. 79 79 80 80 \subsection{Idle Sleep} 81 A difficult challenge, not fully address in this thesis, is idle-sleep.81 A difficult challenge, not fully addressed in this thesis, is idle-sleep. 82 82 While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. 83 83 The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. 84 84 Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. 85 While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism.85 While relaxed timestamps and topology awareness made notable performance improvements, neither of these techniques are used for the idle-sleep mechanism. 86 86 87 Here are opportunities where these techniques could be use :87 Here are opportunities where these techniques could be used: 88 88 \begin{itemize} 89 89 \item … … 100 100 101 101 \subsection{Hardware} 102 One challenge that needed to be overcome for this thesis is that the modern x86-64 processors ha svery few tools to implement fairness.102 One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness. 103 103 \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. 104 104 However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. … … 111 111 Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire. 112 112 This approach has a similar effect, where cache lines with more traffic are waited on less often. 113 In both of these examples, some care is needed to ensure that reads to an address \emph{sometime } retire.113 In both of these examples, some care is needed to ensure that reads to an address \emph{sometimes} retire. 114 114 115 Note , this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.115 Note that this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled back if they encounter memory conflicts when being retired. 116 116 However, I believe this feature is generally aimed at large groups of instructions. 117 117 A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
Note: See TracChangeset
for help on using the changeset viewer.