1 | \chapter{Conclusion}\label{conclusion} |
---|
2 | |
---|
3 | Building the \CFA runtime has been a challenging project. |
---|
4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis). |
---|
5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. |
---|
6 | This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future. |
---|
7 | |
---|
8 | I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward. |
---|
9 | However, in doing so I discovered two expected challenges. |
---|
10 | First, while modern symmetric multiprocessing CPU have a significant penalties for communicating across cores, state-of-the-art work-stealing schedulers are very effective a avoid the need for communication in many common workloads. |
---|
11 | This leaves very little space for adding fairness guarantees without notable performance penalties. |
---|
12 | Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility of use. |
---|
13 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure in order to achieve acceptable performance. |
---|
14 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing. |
---|
15 | It is unfortunate that little has changed in the intervening years. |
---|
16 | |
---|
17 | Also, my decision to use @io_uring@ was both a positive and negative. |
---|
18 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; |
---|
19 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. |
---|
20 | Merging all these different I/O mechanisms into a coherent scheduling implementation would require a much more work than what is present in this thesis, as well as detailed knowledge of the I/O mechanisms in Linux. |
---|
21 | The negative is that @io_uring@ is new and developing. |
---|
22 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. |
---|
23 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O. |
---|
24 | It does not seem to reach deep into the Kernel's handling of \io, and as such it must contend with the same realities that users of epoll must contend with. |
---|
25 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried. |
---|
26 | To preserve the illusion of asynchronicity, this requires delegating operations to kernel threads. |
---|
27 | This is also true of cases where @O_NONBLOCK@ does not prevent blocking. |
---|
28 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system. |
---|
29 | Nonblocking I/O should not be handled in this way. |
---|
30 | |
---|
31 | \section{Goals} |
---|
32 | This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. |
---|
33 | The levels of indirection to the CPUs are: |
---|
34 | \begin{itemize} |
---|
35 | \item |
---|
36 | The \CFA presentation of concurrency through multiple high-level language constructs. |
---|
37 | \item |
---|
38 | The OS presentation of concurrency through multiple kernel threads within an application. |
---|
39 | \item |
---|
40 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. |
---|
41 | \end{itemize} |
---|
42 | The key aspect of all of these mechanisms is that control flow can block, which immidiately hinders any level above from making scheduling decision as a result. |
---|
43 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. |
---|
44 | |
---|
45 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness. |
---|
46 | However, direct hardware scheduling is only possible in the OS. |
---|
47 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components. |
---|
48 | This can quickly lead to tensions if the OS interface was built with different use cases in mind. |
---|
49 | |
---|
50 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. |
---|
51 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime. |
---|
52 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. |
---|
53 | |
---|
54 | This thesis achieves its stated contributes by presenting: |
---|
55 | \begin{enumerate}[leftmargin=*] |
---|
56 | \item |
---|
57 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness. |
---|
58 | \item |
---|
59 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness. |
---|
60 | \item |
---|
61 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself. |
---|
62 | \item |
---|
63 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
64 | \end{enumerate} |
---|
65 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low. |
---|
66 | |
---|
67 | \section{Future Work} |
---|
68 | While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. |
---|
69 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. |
---|
70 | |
---|
71 | \subsection{Idle Sleep} |
---|
72 | A difficult challenge, not fully address in this thesis, is idle-sleep. |
---|
73 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
74 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
75 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
76 | While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism. |
---|
77 | |
---|
78 | Here are opportunities where these techniques could be use: |
---|
79 | \begin{itemize} |
---|
80 | \item |
---|
81 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed. |
---|
82 | \item |
---|
83 | The correctness of that hand-shake is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
84 | \item |
---|
85 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
86 | \end{itemize} |
---|
87 | However, using these techniques would require significant investigation. |
---|
88 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth. |
---|
89 | The balance between these approaches is not obvious. |
---|
90 | |
---|
91 | \subsection{Hardware} |
---|
92 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness. |
---|
93 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. |
---|
94 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
95 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
96 | |
---|
97 | Many different extensions are suitable here. |
---|
98 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency. |
---|
99 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed. |
---|
100 | As such, simply moving on without the result is likely to be acceptable. |
---|
101 | Another option would be to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
102 | This approach has a similar effect, where cache-lines with more traffic would be waited on less often. |
---|
103 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. |
---|
104 | |
---|
105 | Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. |
---|
106 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
107 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|