1 | \chapter{Conclusion}\label{conclusion}
|
---|
2 |
|
---|
3 | Building the \CFA runtime has been a challenging project.
|
---|
4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis).
|
---|
5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
|
---|
6 | This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future.
|
---|
7 |
|
---|
8 | I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward.
|
---|
9 | However, I discovered two significant challenges.
|
---|
10 |
|
---|
11 | First, modern symmetric multiprocessing CPUs have significant performance penalties for communication, often cache-related.
|
---|
12 | An SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs.
|
---|
13 | An MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.
|
---|
14 | However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge.
|
---|
15 | For balanced workloads with little or no data sharing, \ie embarrassingly parallel, an MQMS scheduler is near optimal, \eg a state-of-the-art work-stealing scheduler.
|
---|
16 | For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers.
|
---|
17 | While I was aware of these realities, I underestimated how little performance margin there is for communication.
|
---|
18 | Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of the thin communication margin.
|
---|
19 |
|
---|
20 | Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading.
|
---|
21 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance.
|
---|
22 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing.
|
---|
23 | Unfortunately, little has changed in the intervening years.
|
---|
24 |
|
---|
25 | Also, my decision to use @io_uring@ was both positive and negative.
|
---|
26 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
|
---|
27 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
|
---|
28 | Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
|
---|
29 | The negative is that @io_uring@ is new and developing.
|
---|
30 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
|
---|
31 |
|
---|
32 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O.
|
---|
33 | It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with.
|
---|
34 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried.
|
---|
35 | Preserving the illusion of asynchronicity requires delegating these operations to kernel threads.
|
---|
36 | This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking.
|
---|
37 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system.
|
---|
38 | Nonblocking I/O should not be handled in this way.
|
---|
39 |
|
---|
40 | \section{Goals}
|
---|
41 | This work focuses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers.
|
---|
42 | The levels of indirection to the CPUs are:
|
---|
43 | \begin{itemize}
|
---|
44 | \item
|
---|
45 | The \CFA presentation of concurrency through multiple high-level language constructs.
|
---|
46 | \item
|
---|
47 | The OS presentation of concurrency through multiple kernel threads within an application.
|
---|
48 | \item
|
---|
49 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms.
|
---|
50 | \end{itemize}
|
---|
51 | The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decisions as a result.
|
---|
52 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes.
|
---|
53 |
|
---|
54 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness.
|
---|
55 | However, direct hardware scheduling is only possible in the OS.
|
---|
56 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components.
|
---|
57 | This can quickly lead to tensions when the OS interface has different use cases in mind.
|
---|
58 |
|
---|
59 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
|
---|
60 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime.
|
---|
61 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
|
---|
62 |
|
---|
63 | This thesis achieves its stated contributions by presenting:
|
---|
64 | \begin{enumerate}[leftmargin=*]
|
---|
65 | \item
|
---|
66 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness.
|
---|
67 | \item
|
---|
68 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness.
|
---|
69 | \item
|
---|
70 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself.
|
---|
71 | \item
|
---|
72 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
|
---|
73 | \end{enumerate}
|
---|
74 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low.
|
---|
75 |
|
---|
76 | \section{Future Work}
|
---|
77 | While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
|
---|
78 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms.
|
---|
79 |
|
---|
80 | \subsection{Idle Sleep}
|
---|
81 | A difficult challenge, not fully addressed in this thesis, is idle sleep.
|
---|
82 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
|
---|
83 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
|
---|
84 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up.
|
---|
85 | While relaxed timestamps and topology awareness made notable performance improvements, neither of these techniques are used for the idle-sleep mechanism.
|
---|
86 |
|
---|
87 | Here are opportunities where these techniques could be used:
|
---|
88 | \begin{itemize}
|
---|
89 | \item
|
---|
90 | The mechanism uses a handshake between notification and sleep to ensure that no \at is missed.
|
---|
91 | \item
|
---|
92 | The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
|
---|
93 | \item
|
---|
94 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
|
---|
95 | \end{itemize}
|
---|
96 | However, using these techniques would require significant investigation.
|
---|
97 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth.
|
---|
98 | The balance between these approaches is not obvious.
|
---|
99 | I am aware there is a host of low-power research that could be tapped here.
|
---|
100 |
|
---|
101 | \subsection{Hardware}
|
---|
102 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness.
|
---|
103 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic.
|
---|
104 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive.
|
---|
105 | In cases like this one, there is an opportunity to improve performance by extending the hardware.
|
---|
106 |
|
---|
107 | Many different extensions are suitable here.
|
---|
108 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency.
|
---|
109 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed.
|
---|
110 | As such, simply moving on without the result is likely to be acceptable.
|
---|
111 | Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
|
---|
112 | This approach has a similar effect, where cache lines with more traffic are waited on less often.
|
---|
113 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometimes} retire.
|
---|
114 |
|
---|
115 | Note that this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled back if they encounter memory conflicts when being retired.
|
---|
116 | However, I believe this feature is generally aimed at large groups of instructions.
|
---|
117 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
|
---|