[5378f33] | 1 | \chapter{Conclusion}\label{conclusion} |
---|
| 2 | |
---|
[7e5da64] | 3 | Building the \CFA runtime has been a challenging project. |
---|
[38a238d] | 4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis). |
---|
| 5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation. |
---|
| 6 | This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future. |
---|
[83cb754] | 7 | |
---|
[38a238d] | 8 | I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward. |
---|
[0fec6c1] | 9 | However, I discovered two significant challenges. |
---|
| 10 | |
---|
| 11 | First, modern symmetric multiprocessing CPU have significant performance penalties for communication (often cache related). |
---|
| 12 | A SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs. |
---|
| 13 | A MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication. |
---|
| 14 | However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge. |
---|
| 15 | % This challenge is made harder when comparing against MQMS schedulers (see Section\ref{sched}) which have very little inter-\proc communication. |
---|
| 16 | For balanced workloads with little or no data sharing (embarrassingly parallel), an MQMS scheduler is near optimal, \eg state-of-the-art work-stealing schedulers. |
---|
| 17 | For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers. |
---|
| 18 | |
---|
| 19 | Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility, and are not designed to facilitate user-level threading. |
---|
[7e5da64] | 20 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure in order to achieve acceptable performance. |
---|
[38a238d] | 21 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing. |
---|
[7e5da64] | 22 | It is unfortunate that little has changed in the intervening years. |
---|
[5378f33] | 23 | |
---|
[38a238d] | 24 | Also, my decision to use @io_uring@ was both a positive and negative. |
---|
[7e5da64] | 25 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux; |
---|
| 26 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O. |
---|
[0fec6c1] | 27 | Merging all these different I/O mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as a detailed knowledge of multiple I/O mechanisms. |
---|
[38a238d] | 28 | The negative is that @io_uring@ is new and developing. |
---|
| 29 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds. |
---|
[0fec6c1] | 30 | |
---|
[38a238d] | 31 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O. |
---|
[0fec6c1] | 32 | It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with. |
---|
[7e5da64] | 33 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried. |
---|
[0fec6c1] | 34 | To preserve the illusion of asynchronicity requires delegating these operations to kernel threads. |
---|
| 35 | This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking. |
---|
[7e5da64] | 36 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system. |
---|
[38a238d] | 37 | Nonblocking I/O should not be handled in this way. |
---|
| 38 | |
---|
| 39 | \section{Goals} |
---|
| 40 | This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers. |
---|
| 41 | The levels of indirection to the CPUs are: |
---|
| 42 | \begin{itemize} |
---|
| 43 | \item |
---|
| 44 | The \CFA presentation of concurrency through multiple high-level language constructs. |
---|
| 45 | \item |
---|
| 46 | The OS presentation of concurrency through multiple kernel threads within an application. |
---|
| 47 | \item |
---|
| 48 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms. |
---|
| 49 | \end{itemize} |
---|
[0fec6c1] | 50 | The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decision as a result. |
---|
[38a238d] | 51 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes. |
---|
| 52 | |
---|
[f403c46] | 53 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness. |
---|
| 54 | However, direct hardware scheduling is only possible in the OS. |
---|
| 55 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components. |
---|
[0fec6c1] | 56 | This can quickly lead to tensions when the OS interface has different use cases in mind. |
---|
[38a238d] | 57 | |
---|
| 58 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals. |
---|
| 59 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime. |
---|
| 60 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs. |
---|
| 61 | |
---|
| 62 | This thesis achieves its stated contributes by presenting: |
---|
| 63 | \begin{enumerate}[leftmargin=*] |
---|
| 64 | \item |
---|
| 65 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness. |
---|
| 66 | \item |
---|
| 67 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness. |
---|
| 68 | \item |
---|
| 69 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself. |
---|
| 70 | \item |
---|
| 71 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system. |
---|
| 72 | \end{enumerate} |
---|
[0fec6c1] | 73 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state-changes is low. |
---|
[5378f33] | 74 | |
---|
| 75 | \section{Future Work} |
---|
[38a238d] | 76 | While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate. |
---|
[7e5da64] | 77 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms. |
---|
[5378f33] | 78 | |
---|
| 79 | \subsection{Idle Sleep} |
---|
[38a238d] | 80 | A difficult challenge, not fully address in this thesis, is idle-sleep. |
---|
| 81 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system. |
---|
[d93ea1d] | 82 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping. |
---|
| 83 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up. |
---|
[38a238d] | 84 | While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism. |
---|
[5378f33] | 85 | |
---|
[38a238d] | 86 | Here are opportunities where these techniques could be use: |
---|
| 87 | \begin{itemize} |
---|
| 88 | \item |
---|
[d93ea1d] | 89 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed. |
---|
[38a238d] | 90 | \item |
---|
[0fec6c1] | 91 | The hand-shake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake. |
---|
[38a238d] | 92 | \item |
---|
[7e5da64] | 93 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead. |
---|
[38a238d] | 94 | \end{itemize} |
---|
| 95 | However, using these techniques would require significant investigation. |
---|
| 96 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth. |
---|
| 97 | The balance between these approaches is not obvious. |
---|
[0fec6c1] | 98 | I am aware there is a host of low-power research that could be tapped here. |
---|
[d93ea1d] | 99 | |
---|
| 100 | \subsection{Hardware} |
---|
[38a238d] | 101 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness. |
---|
| 102 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic. |
---|
[d93ea1d] | 103 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive. |
---|
| 104 | In cases like this one, there is an opportunity to improve performance by extending the hardware. |
---|
| 105 | |
---|
[38a238d] | 106 | Many different extensions are suitable here. |
---|
| 107 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency. |
---|
| 108 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed. |
---|
[d93ea1d] | 109 | As such, simply moving on without the result is likely to be acceptable. |
---|
[0fec6c1] | 110 | Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire. |
---|
[83cb754] | 111 | This approach has a similar effect, where cache lines with more traffic are waited on less often. |
---|
[38a238d] | 112 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. |
---|
[d93ea1d] | 113 | |
---|
[38a238d] | 114 | Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. |
---|
[d93ea1d] | 115 | However, I believe this feature is generally aimed at large groups of instructions. |
---|
[38a238d] | 116 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not. |
---|