1 | \chapter{Conclusion}\label{conclusion} |
---|
2 | |
---|
3 | Building the \CFA runtime has been a challenging project. |
---|
4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis). |
---|
5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. |
---|
6 | This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future. |
---|
7 | |
---|
8 | I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward. |
---|
9 | However, I discovered two significant challenges. |
---|
10 | |
---|
11 | First, modern symmetric multiprocessing CPU have significant performance penalties for communication (often cache related). |
---|
12 | A SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs. |
---|
13 | A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication. |
---|
14 | However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge. |
---|
15 | For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler, \eg a state-of-the-art work-stealing scheduler, is near optimal. |
---|
16 | For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers. |
---|
17 | While I was aware of these realities, I underestimated how little performance margin there is for communication. |
---|
18 | Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of how thin the margin is. |
---|
19 | |
---|
20 | Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading. |
---|
21 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure in order to achieve acceptable performance. |
---|
22 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing. |
---|
23 | It is unfortunate that little has changed in the intervening years. |
---|
24 | |
---|
25 | Also, my decision to use @io_uring@ was both a positive and negative. |
---|
26 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; |
---|
27 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. |
---|
28 | Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as a detailed knowledge of multiple I/O mechanisms. |
---|
29 | The negative is that @io_uring@ is new and developing. |
---|
30 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. |
---|
31 | |
---|
32 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O. |
---|
33 | It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with. |
---|
34 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried. |
---|
35 | To preserve the illusion of asynchronicity requires delegating these operations to kernel threads. |
---|
36 | This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking. |
---|
37 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system. |
---|
38 | Nonblocking I/O should not be handled in this way. |
---|
39 | |
---|
40 | \section{Goals} |
---|
41 | This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. |
---|
42 | The levels of indirection to the CPUs are: |
---|
43 | \begin{itemize} |
---|
44 | \item |
---|
45 | The \CFA presentation of concurrency through multiple high-level language constructs. |
---|
46 | \item |
---|
47 | The OS presentation of concurrency through multiple kernel threads within an application. |
---|
48 | \item |
---|
49 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. |
---|
50 | \end{itemize} |
---|
51 | The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decision as a result. |
---|
52 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. |
---|
53 | |
---|
54 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness. |
---|
55 | However, direct hardware scheduling is only possible in the OS. |
---|
56 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components. |
---|
57 | This can quickly lead to tensions when the OS interface has different use cases in mind. |
---|
58 | |
---|
59 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. |
---|
60 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime. |
---|
61 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. |
---|
62 | |
---|
63 | This thesis achieves its stated contributes by presenting: |
---|
64 | \begin{enumerate}[leftmargin=*] |
---|
65 | \item |
---|
66 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness. |
---|
67 | \item |
---|
68 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness. |
---|
69 | \item |
---|
70 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself. |
---|
71 | \item |
---|
72 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
73 | \end{enumerate} |
---|
74 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state-changes is low. |
---|
75 | |
---|
76 | \section{Future Work} |
---|
77 | While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. |
---|
78 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. |
---|
79 | |
---|
80 | \subsection{Idle Sleep} |
---|
81 | A difficult challenge, not fully address in this thesis, is idle-sleep. |
---|
82 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
83 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
84 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
85 | While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism. |
---|
86 | |
---|
87 | Here are opportunities where these techniques could be use: |
---|
88 | \begin{itemize} |
---|
89 | \item |
---|
90 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed. |
---|
91 | \item |
---|
92 | The hand-shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
93 | \item |
---|
94 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
95 | \end{itemize} |
---|
96 | However, using these techniques would require significant investigation. |
---|
97 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth. |
---|
98 | The balance between these approaches is not obvious. |
---|
99 | I am aware there is a host of low-power research that could be tapped here. |
---|
100 | |
---|
101 | \subsection{Hardware} |
---|
102 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness. |
---|
103 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. |
---|
104 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
105 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
106 | |
---|
107 | Many different extensions are suitable here. |
---|
108 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency. |
---|
109 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed. |
---|
110 | As such, simply moving on without the result is likely to be acceptable. |
---|
111 | Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
112 | This approach has a similar effect, where cache lines with more traffic are waited on less often. |
---|
113 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. |
---|
114 | |
---|
115 | Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. |
---|
116 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
117 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|