| 1 | \chapter{Conclusion}\label{conclusion}
|
|---|
| 2 |
|
|---|
| 3 | Building the \CFA runtime has been a challenging project.
|
|---|
| 4 | The work was divided between high-level concurrency design and a user-level threading runtime (Masters thesis), and low-level support of the user-level runtime using OS kernel-threading and its (multiple) I/O subsystems (Ph.D. thesis).
|
|---|
| 5 | Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
|
|---|
| 6 | This continuity provides a consistent approach to advanced control-flow and concurrency, with easier development, management and maintenance of the runtime in the future.
|
|---|
| 7 |
|
|---|
| 8 | I believed my Masters work would provide the background to make the Ph.D work reasonably straightforward.
|
|---|
| 9 | However, in doing so I discovered two expected challenges.
|
|---|
| 10 | First, while modern symmetric multiprocessing CPU have a significant penalties for communicating across cores, state-of-the-art work-stealing schedulers are very effective a avoid the need for communication in many common workloads.
|
|---|
| 11 | This leaves very little space for adding fairness guarantees without notable performance penalties.
|
|---|
| 12 | Second, the kernel locking, threading, and I/O in the Linux operating system offers very little flexibility of use.
|
|---|
| 13 | There are multiple concurrency aspects in Linux that require carefully following a strict procedure in order to achieve acceptable performance.
|
|---|
| 14 | To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multi-processor computers and concurrency knowledge was just developing.
|
|---|
| 15 | It is unfortunate that little has changed in the intervening years.
|
|---|
| 16 |
|
|---|
| 17 | Also, my decision to use @io_uring@ was both a positive and negative.
|
|---|
| 18 | The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
|
|---|
| 19 | hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
|
|---|
| 20 | Merging all these different I/O mechanisms into a coherent scheduling implementation would require a much more work than what is present in this thesis, as well as detailed knowledge of the I/O mechanisms in Linux.
|
|---|
| 21 | The negative is that @io_uring@ is new and developing.
|
|---|
| 22 | As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
|
|---|
| 23 | Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O.
|
|---|
| 24 | It does not seem to reach deep into the Kernel's handling of \io, and as such it must contend with the same realities that users of epoll must contend with.
|
|---|
| 25 | Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried.
|
|---|
| 26 | To preserve the illusion of asynchronicity, this requires delegating operations to kernel threads.
|
|---|
| 27 | This is also true of cases where @O_NONBLOCK@ does not prevent blocking.
|
|---|
| 28 | Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds significant burden to the system.
|
|---|
| 29 | Nonblocking I/O should not be handled in this way.
|
|---|
| 30 |
|
|---|
| 31 | \section{Goals}
|
|---|
| 32 |
|
|---|
| 33 | The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness.
|
|---|
| 34 | However, direct hardware scheduling is only possible in the OS.
|
|---|
| 35 | Instead, this thesis is performing arms-length application scheduling of the hardware components through a complex set of OS interfaces that indirectly manipulate the hardware components.
|
|---|
| 36 | Couple that with the OS running multiple applications with its own goals for scheduling among them.
|
|---|
| 37 | Hence, my goal became the battle of two schedulers.
|
|---|
| 38 |
|
|---|
| 39 | This work focusses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers.
|
|---|
| 40 | The levels of indirection to the CPUs are:
|
|---|
| 41 | \begin{itemize}
|
|---|
| 42 | \item
|
|---|
| 43 | The \CFA presentation of concurrency through multiple high-level language constructs.
|
|---|
| 44 | \item
|
|---|
| 45 | The OS presentation of concurrency through multiple kernel threads within an application.
|
|---|
| 46 | \item
|
|---|
| 47 | The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms.
|
|---|
| 48 | \end{itemize}
|
|---|
| 49 | The key aspect of all of these mechanisms is that control flow can block, which is the enemy of the scheduler.
|
|---|
| 50 | Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes.
|
|---|
| 51 |
|
|---|
| 52 | Interestingly, there is another major hardware component that affects threading: memory.
|
|---|
| 53 | How memory is organized, and when it is acquired and released, has a significant affect on thread scheduling.
|
|---|
| 54 | To this end, I worked closely with another graduate student, Mubeen Zulfiqar, in the development of a new memory allocator for \CFA.
|
|---|
| 55 | (See Mubeen's thesis~\cite{Zulfiqar22} for a discussion of how threading is managed in the \CFA memory-allocator.)
|
|---|
| 56 |
|
|---|
| 57 | % An important aspect of this approach to threading is how threads are scheduled.
|
|---|
| 58 | As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
|
|---|
| 59 | Productivity and safety manifest in removing scheduling pitfalls in the efficient usage of the threading runtime.
|
|---|
| 60 | Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
|
|---|
| 61 |
|
|---|
| 62 | This thesis achieves its stated contributes by presenting:
|
|---|
| 63 | \begin{enumerate}[leftmargin=*]
|
|---|
| 64 | \item
|
|---|
| 65 | A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness.
|
|---|
| 66 | \item
|
|---|
| 67 | The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness.
|
|---|
| 68 | \item
|
|---|
| 69 | An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself.
|
|---|
| 70 | \item
|
|---|
| 71 | These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
|
|---|
| 72 | \end{enumerate}
|
|---|
| 73 | Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low.
|
|---|
| 74 |
|
|---|
| 75 | \section{Future Work}
|
|---|
| 76 | While the \CFA runtime achieves a better compromise, in term of performance and fairness, than other schedulers, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
|
|---|
| 77 | Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms.
|
|---|
| 78 |
|
|---|
| 79 | \subsection{Idle Sleep}
|
|---|
| 80 | A difficult challenge, not fully address in this thesis, is idle-sleep.
|
|---|
| 81 | While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
|
|---|
| 82 | The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
|
|---|
| 83 | Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up.
|
|---|
| 84 | While relaxed timestamps and topology awareness made a notable improvements in performance, neither of these techniques are used for the idle-sleep mechanism.
|
|---|
| 85 |
|
|---|
| 86 | Here are opportunities where these techniques could be use:
|
|---|
| 87 | \begin{itemize}
|
|---|
| 88 | \item
|
|---|
| 89 | The mechanism uses a hand-shake between notification and sleep to ensure that no \at is missed.
|
|---|
| 90 | \item
|
|---|
| 91 | The correctness of that hand-shake is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
|
|---|
| 92 | \item
|
|---|
| 93 | Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
|
|---|
| 94 | \end{itemize}
|
|---|
| 95 | However, using these techniques would require significant investigation.
|
|---|
| 96 | For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth.
|
|---|
| 97 | The balance between these approaches is not obvious.
|
|---|
| 98 |
|
|---|
| 99 | \subsection{Hardware}
|
|---|
| 100 | One challenge that needed to be overcome for this thesis is that the modern x86-64 processors has very few tools to implement fairness.
|
|---|
| 101 | \Glspl{proc} attempting to help each other inherently cause cache-coherence traffic.
|
|---|
| 102 | However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive.
|
|---|
| 103 | In cases like this one, there is an opportunity to improve performance by extending the hardware.
|
|---|
| 104 |
|
|---|
| 105 | Many different extensions are suitable here.
|
|---|
| 106 | For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency.
|
|---|
| 107 | If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed.
|
|---|
| 108 | As such, simply moving on without the result is likely to be acceptable.
|
|---|
| 109 | Another option would be to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
|
|---|
| 110 | This approach has a similar effect, where cache-lines with more traffic would be waited on less often.
|
|---|
| 111 | In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire.
|
|---|
| 112 |
|
|---|
| 113 | Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired.
|
|---|
| 114 | However, I believe this feature is generally aimed at large groups of instructions.
|
|---|
| 115 | A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
|
|---|