1 | \chapter{Conclusion}\label{conclusion} |
---|
2 | |
---|
3 | Building the \CFA runtime has been an extremely challenging project. |
---|
4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis). |
---|
5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. |
---|
6 | This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future. |
---|
7 | |
---|
8 | I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward. |
---|
9 | What I discovered is that interacting with kernel locking, threading, and I/O in the UNIX (Linux) operating system is extremely difficult. |
---|
10 | There are multiple concurrency aspects in UNIX that are poorly designed, not only for user-level threading but also for kernel-level threading. |
---|
11 | Basically, UNIX-based OSs are not very concurrency friendly. |
---|
12 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing. |
---|
13 | It is unfortunately so little has changed in the intervening years. |
---|
14 | |
---|
15 | Also, my decision to use @io_uring@ was both a positive and negative. |
---|
16 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in UNIX; |
---|
17 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and some unknown to handle disk I/O. |
---|
18 | It is unclear to me how I would have merged all these different I/O mechanisms into a coherent scheduling implementation. |
---|
19 | The negative is that @io_uring@ is new and developing. |
---|
20 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. |
---|
21 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O. |
---|
22 | Specifically, spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel. |
---|
23 | Nonblocking I/O should not be handled in this way. |
---|
24 | |
---|
25 | % While \gls{uthrding} is an old idea, going back to the first multi-processor computers, it was largely set aside in the 1990s because of the complexity in making it work well between applications and the operating system. |
---|
26 | % Unfortunately,, a large amount of that complexity still exists, making \gls{uthrding} a difficult task for library and programming-language implementers. |
---|
27 | % For example, the introduction of thread-local storage and its usage in many C libraries causes the serially reusable problem~\cite{SeriallyReusable} for all \gls{uthrding} implementers. |
---|
28 | % Specifically, if a \gls{kthrd} is preempted, it always restarts with the same thread-local storage; |
---|
29 | % when a user thread is preempted, it can be restarted on another \gls{kthrd}, accessing the new thread-local storage, or worse, the previous thread-local storage. |
---|
30 | % The latter case causes failures when an operation using the thread-local storage is assumed to be atomic at the kernel-threading level. |
---|
31 | % If library implementers always used the pthreads interface to access threads, locks, and thread-local storage, language runtimes can interpose a \gls{uthrding} version of pthreads, switching the kind of threads, locks, and storage, and hence, make it safe. |
---|
32 | % However, C libraries are currently filled with direct declarations of thread-local storage and low-level atomic instructions. |
---|
33 | % In essence, every library developer is inventing their own threading mechanisms to solve their unique problem, independent from any standardize approaches. |
---|
34 | % This state of affairs explains why concurrency is such a mess. |
---|
35 | % |
---|
36 | % To address the concurrency mess, new programming languages integrate threading into the language rather than using the operating-system supplied interface. |
---|
37 | % The reason is that many sequential code-optimizations invalid correctly written concurrent programs. |
---|
38 | % While providing safe concurrent-code generation is essential, the underlying language runtime is still free to implement threading using kernel or user/kernel threading, \eg Java versus Go. |
---|
39 | % In either case, the language/runtime manages all the details to simplify concurrency and increase safety. |
---|
40 | |
---|
41 | \section{Goals} |
---|
42 | |
---|
43 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness. |
---|
44 | However, direct hardware scheduling is only possible in the OS. |
---|
45 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a complex set of OS interfaces that indirectly manipulate the hardware components. |
---|
46 | Couple that with the OS running multiple applications with its own goals for scheduling among them. |
---|
47 | Hence, my goal became the battle of two schedulers. |
---|
48 | |
---|
49 | This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. |
---|
50 | The levels of indirection to the CPUs are: |
---|
51 | \begin{itemize} |
---|
52 | \item |
---|
53 | The \CFA presentation of concurrency through multiple high-level language constructs. |
---|
54 | \item |
---|
55 | The OS presentation of concurrency through multiple kernel threads within an application. |
---|
56 | \item |
---|
57 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. |
---|
58 | \end{itemize} |
---|
59 | The key aspect of all of these mechanisms is that control flow can block, which is the enemy of the scheduler. |
---|
60 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. |
---|
61 | |
---|
62 | Interestingly, there is another major hardware component that affects threading: memory. |
---|
63 | How memory is organized, and when it is acquired and released, has a significant affect on thread scheduling. |
---|
64 | To this end, I worked closely with another graduate student, Mubeen Zulfiqar, in the development of a new memory allocator for \CFA. |
---|
65 | (See Mubeen's thesis~\cite{Zulfiqar22} for a discussion of how threading is managed in the \CFA memory-allocator.) |
---|
66 | |
---|
67 | % An important aspect of this approach to threading is how threads are scheduled. |
---|
68 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. |
---|
69 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime. |
---|
70 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. |
---|
71 | |
---|
72 | This thesis achieves its stated contributes by presenting: |
---|
73 | \begin{enumerate}[leftmargin=*] |
---|
74 | \item |
---|
75 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness. |
---|
76 | \item |
---|
77 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness. |
---|
78 | \item |
---|
79 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself. |
---|
80 | \item |
---|
81 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
82 | \end{enumerate} |
---|
83 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low. |
---|
84 | |
---|
85 | \section{Future Work} |
---|
86 | While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. |
---|
87 | Fundamentally, achieving performance and starvation freedom will always be opposing goals even outside of scheduling algorithms. |
---|
88 | |
---|
89 | \subsection{Idle Sleep} |
---|
90 | A difficult challenge, not fully address in this thesis, is idle-sleep. |
---|
91 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
92 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
93 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
94 | While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism. |
---|
95 | |
---|
96 | Here are opportunities where these techniques could be use: |
---|
97 | \begin{itemize} |
---|
98 | \item |
---|
99 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed. |
---|
100 | \item |
---|
101 | The correctness of that hand-shake is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
102 | \item |
---|
103 | Furthermore, organizing the sleeping \procs as a LIDO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
104 | \end{itemize} |
---|
105 | However, using these techniques would require significant investigation. |
---|
106 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth. |
---|
107 | The balance between these approaches is not obvious. |
---|
108 | |
---|
109 | \subsection{Hardware} |
---|
110 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness. |
---|
111 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. |
---|
112 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
113 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
114 | |
---|
115 | Many different extensions are suitable here. |
---|
116 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency. |
---|
117 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed. |
---|
118 | As such, simply moving on without the result is likely to be acceptable. |
---|
119 | Another option would be to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
120 | This approach has a similar effect, where cache-lines with more traffic would be waited on less often. |
---|
121 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. |
---|
122 | |
---|
123 | Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. |
---|
124 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
125 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|