source: doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex @ ff56dd2e

Last change on this file since ff56dd2e was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 2 years ago

Last corrections to my thesis... hopefully

  • Property mode set to 100644
File size: 10.7 KB
Line 
1\chapter{Conclusion}\label{conclusion}
2
3Building the \CFA runtime has been a challenging project.
4The work was divided between high-level concurrency design and a user-level threading runtime (Masters' thesis), and low-level support of the user-level runtime using OS kernel threading and its (multiple) I/O subsystems (Ph.D. thesis).
5Because I am the main developer for both components of this project, there is strong continuity across the design and implementation.
6This continuity provides a consistent approach to advanced control flow and concurrency, with easier development, management and maintenance of the runtime in the future.
7
8I believed my Masters' work would provide the background to make the Ph.D. work reasonably straightforward.
9However, I discovered two significant challenges.
10
11First, modern symmetric multiprocessing CPUs have significant performance penalties for communication, often cache-related.
12An SQMS scheduler (see Section~\ref{sched}), with its \proc-shared ready-queue, has perfect load-balancing but poor affinity resulting in high communication across \procs.
13An MQMS scheduler, with its \proc-specific ready-queues, has poor load-balancing but perfect affinity often resulting in significantly reduced communication.
14However, implementing fairness for an MQMS scheduler is difficult, since fairness requires \procs to be aware of each other's ready-queue progress, \ie communicated knowledge.
15For balanced workloads with little or no data sharing, \ie embarrassingly parallel, an MQMS scheduler is near optimal, \eg a state-of-the-art work-stealing scheduler.
16For these kinds of fair workloads, adding fairness must be low-cost to hide the communication costs needed for global ready-queue progress or performance suffers.
17While I was aware of these realities, I underestimated how little performance margin there is for communication.
18Several of my attempts at building a fair scheduler compared poorly to work-stealing schedulers because of the thin communication margin.
19
20Second, the kernel locking, threading, and I/O in the Linux operating system offer very little flexibility and are not designed to facilitate user-level threading.
21There are multiple concurrency aspects in Linux that require carefully following a strict procedure to achieve acceptable performance.
22To be fair, many of these concurrency aspects were designed 30-40 years ago, when there were few multiprocessor computers and concurrency knowledge was just developing.
23Unfortunately, little has changed in the intervening years.
24
25Also, my decision to use @io_uring@ was both positive and negative.
26The positive is that @io_uring@ supports the panoply of I/O mechanisms in Linux;
27hence, the \CFA runtime uses one I/O mechanism to provide non-blocking I/O, rather than using @select@ to handle TTY I/O, @epoll@ to handle network I/O, and managing a thread pool to handle disk I/O.
28Merging all these different \io mechanisms into a coherent scheduling implementation would require much more work than what is present in this thesis, as well as detailed knowledge of multiple I/O mechanisms.
29The negative is that @io_uring@ is new and developing.
30As a result, there is limited documentation, few places to find usage examples, and multiple errors that required workarounds.
31
32Given what I now know about @io_uring@, I would say it is insufficiently coupled with the Linux kernel to properly handle non-blocking I/O.
33It does not seem to reach deep into the kernel's handling of \io, and as such it must contend with the same realities that users of @epoll@ must contend with.
34Specifically, in cases where @O_NONBLOCK@ behaves as desired, operations must still be retried.
35Preserving the illusion of asynchronicity requires delegating these operations to kernel threads.
36This requirement is also true of cases where @O_NONBLOCK@ does not prevent blocking.
37Spinning up internal kernel threads to handle blocking scenarios is what developers already do outside of the kernel, and managing these threads adds a significant burden to the system.
38Nonblocking I/O should not be handled in this way.
39Presumably, this is better handled by Windows's ``overlapped I/O'', however porting \CFA to Windows is far beyond the scope of this work.
40
41\section{Goals}
42This work focuses on efficient and fair scheduling of the multiple CPUs, which are ubiquitous on all modern computers.
43The levels of indirection to the CPUs are:
44\begin{itemize}
45\item
46The \CFA presentation of concurrency through multiple high-level language constructs.
47\item
48The OS presentation of concurrency through multiple kernel threads within an application.
49\item
50The OS and library presentation of disk and network I/O, and many secondary library routines that directly and indirectly use these mechanisms.
51\end{itemize}
52The key aspect of all of these mechanisms is that control flow can block, which immediately hinders any level above from making scheduling decisions as a result.
53Fundamentally, scheduling needs to understand all the mechanisms used by threads that affect their state changes.
54
55The underlying goal of this thesis is scheduling the complex hardware components that make up a computer to provide good utilization and fairness.
56However, direct hardware scheduling is only possible in the OS.
57Instead, this thesis is performing arms-length application scheduling of the hardware components through a set of OS interfaces that indirectly manipulate the hardware components.
58This can quickly lead to tensions when the OS interface has different use cases in mind.
59
60As \CFA aims to increase productivity and safety of C, while maintaining its performance, this places a huge burden on the \CFA runtime to achieve these goals.
61Productivity and safety manifest in removing scheduling pitfalls from the efficient usage of the threading runtime.
62Performance manifests in making efficient use of the underlying kernel threads that provide indirect access to the CPUs.
63
64This thesis achieves its stated contributions by presenting:
65\begin{enumerate}[leftmargin=*]
66\item
67A scalable low-latency scheduler that offers improved starvation prevention (progress guarantee) compared to other state-of-the-art schedulers, including NUMA awareness.
68\item
69The scheduler demonstrates a core algorithm that provides increased fairness through helping, as well as optimizations which virtually remove the cost of this fairness.
70\item
71An implementation of user-level \io blocking is incorporated into the scheduler, which achieves the same performance and fairness balance as the scheduler itself.
72\item
73These core algorithms are further extended with a low-latency idle-sleep mechanism, which allows the \CFA runtime to stay viable for workloads that do not consistently saturate the system.
74\end{enumerate}
75Finally, the complete scheduler is fairly simple with low-cost execution, meaning the total cost of scheduling during thread state changes is low.
76
77\section{Future Work}
78While the \CFA runtime achieves a better compromise than other schedulers, in terms of performance and fairness, I believe further improvements can be made to reduce or eliminate the few cases where performance does deteriorate.
79Fundamentally, achieving performance and starvation freedom will always be goals with opposing needs even outside of scheduling algorithms.
80
81\subsection{Idle Sleep}
82A difficult challenge, not fully addressed in this thesis, is idle sleep.
83While a correct and somewhat low-cost idle-sleep mechanism is presented, several of the benchmarks show notable performance degradation when too few \ats are present in the system.
84The idle sleep mechanism could therefore benefit from a reduction of spurious cases of sleeping.
85Furthermore, this thesis did not present any heuristic for when \procs should be put to sleep and when \procs should be woken up.
86While relaxed timestamps and topology awareness made notable performance improvements, neither of these techniques are used for the idle-sleep mechanism.
87
88Here are opportunities where these techniques could be used:
89\begin{itemize}
90\item
91The mechanism uses a handshake between notification and sleep to ensure that no \at is missed.
92\item
93The handshake correctness is critical when the last \proc goes to sleep but could be relaxed when several \procs are awake.
94\item
95Furthermore, organizing the sleeping \procs as a LIFO stack makes sense to keep cold \procs as cold as possible, but it might be more appropriate to attempt to keep cold CPU sockets instead.
96\end{itemize}
97However, using these techniques would require significant investigation.
98For example, keeping a CPU socket cold might be appropriate for power consumption reasons but can affect overall memory bandwidth.
99The balance between these approaches is not obvious.
100I am aware there is a host of low-power research that could be tapped here.
101
102\subsection{CPU Workloads}
103A performance consideration related to idle sleep is cpu utilization, \ie, how easy is it
104CPU utilization generally becomes an issue for workloads that are compute bound but where the dependencies among \ats can prevent the scheduler from easily.
105Examining such workloads in the context of scheduling would be interesting.
106However, such workloads are inherently more complex than applications examined in this thesis, and as such warrant it's own work.
107
108\subsection{Hardware}
109One challenge that needed to be overcome for this thesis is that the modern x86-64 processors have very few tools to implement fairness.
110\Glspl{proc} attempting to help each other inherently cause cache-coherence traffic.
111However, as mentioned in Section~\ref{helping}, relaxed requirements mean this traffic is not necessarily productive.
112In cases like this one, there is an opportunity to improve performance by extending the hardware.
113
114Many different extensions are suitable here.
115For example, when attempting to read remote timestamps for helping, it would be useful to allow cancelling the remote read if it leads to significant latency.
116If the latency is due to a recent cache invalidation, it is unlikely the timestamp is old and that helping is needed.
117As such, simply moving on without the result is likely to be acceptable.
118Another option is to read multiple memory addresses and only wait for \emph{one of} these reads to retire.
119This approach has a similar effect, where cache lines with more traffic are waited on less often.
120In both of these examples, some care is needed to ensure that reads to an address \emph{sometimes} retire.
121
122Note that this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled back if they encounter memory conflicts when being retired.
123However, I believe this feature is generally aimed at large groups of instructions.
124A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
Note: See TracBrowser for help on using the repository browser.