[5378f33] | 1 | \chapter{Conclusion}\label{conclusion} |
---|
| 2 | |
---|
[7e5da64] | 3 | Building the \CFA runtime has been a challenging project. |
---|
[fc96890] | 4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis). |
---|
[38a238d] | 5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. |
---|
[1c334d1] | 6 | This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future. |
---|
[83cb754] | 7 | |
---|
[fc96890] | 8 | I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward. |
---|
[0fec6c1] | 9 | However, I discovered two significant challenges. |
---|
| 10 | |
---|
[1c334d1] | 11 | First, modern symmetric multiprocessing CPUs have significant performance penalties for communication, often cache-related. |
---|
| 12 | An SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs. |
---|
| 13 | An MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication. |
---|
[0fec6c1] | 14 | However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge. |
---|
[918e0137] | 15 | For balanced workloads with little or no data sharing, \ie embarrassingly parallel, an MQMS scheduler is near optimal, \eg a state-of-the-art work-stealing scheduler. |
---|
[0fec6c1] | 16 | For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers. |
---|
[7a0f798b] | 17 | While I was aware of these realities, I underestimated how little performance margin there is for communication. |
---|
[e4855f6] | 18 | Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of the thin communication margin. |
---|
[0fec6c1] | 19 | |
---|
[1c334d1] | 20 | Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading. |
---|
| 21 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance. |
---|
[fc96890] | 22 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing. |
---|
[1c334d1] | 23 | Unfortunately, little has changed in the intervening years. |
---|
[5378f33] | 24 | |
---|
[1c334d1] | 25 | Also, my decision to use @io_uring@ was both positive and negative. |
---|
[7e5da64] | 26 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; |
---|
| 27 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. |
---|
[fc96890] | 28 | Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms. |
---|
[38a238d] | 29 | The negative is that @io_uring@ is new and developing. |
---|
| 30 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. |
---|
[0fec6c1] | 31 | |
---|
[38a238d] | 32 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O. |
---|
[0fec6c1] | 33 | It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with. |
---|
[7e5da64] | 34 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried. |
---|
[1c334d1] | 35 | Preserving the illusion of asynchronicity requires delegating these operations to kernel threads. |
---|
[0fec6c1] | 36 | This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking. |
---|
[1c334d1] | 37 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system. |
---|
[38a238d] | 38 | Nonblocking I/O should not be handled in this way. |
---|
[ddcaff6] | 39 | Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work. |
---|
[38a238d] | 40 | |
---|
| 41 | \section{Goals} |
---|
[1c334d1] | 42 | This work focuses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. |
---|
[38a238d] | 43 | The levels of indirection to the CPUs are: |
---|
| 44 | \begin{itemize} |
---|
| 45 | \item |
---|
| 46 | The \CFA presentation of concurrency through multiple high-level language constructs. |
---|
| 47 | \item |
---|
| 48 | The OS presentation of concurrency through multiple kernel threads within an application. |
---|
| 49 | \item |
---|
| 50 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. |
---|
| 51 | \end{itemize} |
---|
[1c334d1] | 52 | The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decisions as a result. |
---|
[38a238d] | 53 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. |
---|
| 54 | |
---|
[f403c46] | 55 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness. |
---|
| 56 | However, direct hardware scheduling is only possible in the OS. |
---|
| 57 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components. |
---|
[0fec6c1] | 58 | This can quickly lead to tensions when the OS interface has different use cases in mind. |
---|
[38a238d] | 59 | |
---|
| 60 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. |
---|
[ddcaff6] | 61 | Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime. |
---|
[38a238d] | 62 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. |
---|
| 63 | |
---|
[1c334d1] | 64 | This thesis achieves its stated contributions by presenting: |
---|
[38a238d] | 65 | \begin{enumerate}[leftmargin=*] |
---|
| 66 | \item |
---|
| 67 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness. |
---|
| 68 | \item |
---|
| 69 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness. |
---|
| 70 | \item |
---|
| 71 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself. |
---|
| 72 | \item |
---|
| 73 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
| 74 | \end{enumerate} |
---|
[1c334d1] | 75 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low. |
---|
[5378f33] | 76 | |
---|
| 77 | \section{Future Work} |
---|
[fc96890] | 78 | While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. |
---|
[7e5da64] | 79 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. |
---|
[5378f33] | 80 | |
---|
| 81 | \subsection{Idle Sleep} |
---|
[fc96890] | 82 | A difficult challenge, not fully addressed in this thesis, is idle sleep. |
---|
[38a238d] | 83 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
[d93ea1d] | 84 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
| 85 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
[1c334d1] | 86 | While relaxed timestamps and topology awareness made notable performance improvements, neither of these techniques are used for the idle-sleep mechanism. |
---|
[5378f33] | 87 | |
---|
[1c334d1] | 88 | Here are opportunities where these techniques could be used: |
---|
[38a238d] | 89 | \begin{itemize} |
---|
| 90 | \item |
---|
[fc96890] | 91 | The mechanism uses a handshake between notification and sleep to ensure that no \at is missed. |
---|
[38a238d] | 92 | \item |
---|
[fc96890] | 93 | The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
[38a238d] | 94 | \item |
---|
[7e5da64] | 95 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
[38a238d] | 96 | \end{itemize} |
---|
| 97 | However, using these techniques would require significant investigation. |
---|
| 98 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth. |
---|
| 99 | The balance between these approaches is not obvious. |
---|
[0fec6c1] | 100 | I am aware there is a host of low-power research that could be tapped here. |
---|
[d93ea1d] | 101 | |
---|
[ddcaff6] | 102 | \subsection{CPU Workloads} |
---|
| 103 | A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it |
---|
| 104 | CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily. |
---|
| 105 | Examining such workloads in the context of scheduling would be interesting. |
---|
| 106 | However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work. |
---|
| 107 | |
---|
[d93ea1d] | 108 | \subsection{Hardware} |
---|
[1c334d1] | 109 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness. |
---|
[38a238d] | 110 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. |
---|
[d93ea1d] | 111 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
| 112 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
| 113 | |
---|
[38a238d] | 114 | Many different extensions are suitable here. |
---|
| 115 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency. |
---|
| 116 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed. |
---|
[d93ea1d] | 117 | As such, simply moving on without the result is likely to be acceptable. |
---|
[0fec6c1] | 118 | Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
[83cb754] | 119 | This approach has a similar effect, where cache lines with more traffic are waited on less often. |
---|
[1c334d1] | 120 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometimes} retire. |
---|
[d93ea1d] | 121 | |
---|
[1c334d1] | 122 | Note that this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled back if they encounter memory conflicts when being retired. |
---|
[d93ea1d] | 123 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
[38a238d] | 124 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|