[86c1f1c3] | 1 | \chapter{Scheduling in practice}\label{practice} |
---|
| 2 | The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state. |
---|
| 3 | However, it does not address problems that occur when the system changes state. |
---|
[6b06abe] | 4 | Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically. |
---|
[86c1f1c3] | 5 | This entails that the scheduling algorithm must support these transitions. |
---|
| 6 | |
---|
[6b06abe] | 7 | More precise \CFA supports adding \procs using the RAII object @processor@. |
---|
| 8 | These objects can be created at any time and can be destroyed at any time. |
---|
[622a358] | 9 | They are normally created as automatic stack variables, but this is not a requirement. |
---|
[86c1f1c3] | 10 | |
---|
[6b06abe] | 11 | The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence. |
---|
| 12 | |
---|
| 13 | \section{Manual Resizing} |
---|
[622a358] | 14 | Manual resizing is expected to be a rare operation. |
---|
| 15 | Programmers are mostly expected to resize clusters on startup or teardown. |
---|
| 16 | Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state. |
---|
| 17 | As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed. |
---|
| 18 | This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}. |
---|
[6b06abe] | 19 | |
---|
[622a358] | 20 | There are no performance requirements, within reason, for resizing since it is expected to be rare. |
---|
[6b06abe] | 21 | However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks. |
---|
| 22 | It should also avoid as much as possible any effect on performance when the number of \procs remain constant. |
---|
[622a358] | 23 | This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays. |
---|
[6b06abe] | 24 | |
---|
| 25 | \subsection{Read-Copy-Update} |
---|
| 26 | One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern. |
---|
| 27 | In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy. |
---|
| 28 | This approach potentially has the advantage that it may not need any synchronization to do the switch. |
---|
[622a358] | 29 | However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in. |
---|
| 30 | This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy. |
---|
[6b06abe] | 31 | |
---|
[622a358] | 32 | For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering. |
---|
| 33 | Dequeing is more challenging. |
---|
[6b06abe] | 34 | Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at. |
---|
[622a358] | 35 | Fixing this requires more synchronization or more indirection on the queues. |
---|
[6b06abe] | 36 | |
---|
| 37 | Another challenge is that the original must be kept until all \procs have witnessed the change. |
---|
| 38 | This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization. |
---|
| 39 | If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance. |
---|
| 40 | Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges. |
---|
| 41 | Especially merging subqueues while having a minimal impact on fairness and locality. |
---|
| 42 | |
---|
| 43 | \subsection{Read-Writer Lock} |
---|
| 44 | A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader. |
---|
| 45 | Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance. |
---|
| 46 | Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken. |
---|
| 47 | |
---|
| 48 | To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections. |
---|
| 49 | This effectively requires that each reader have its own piece of memory to mark as locked and unlocked. |
---|
| 50 | Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks. |
---|
| 51 | Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks. |
---|
| 52 | Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer. |
---|
| 53 | \todo{reference listings} |
---|
| 54 | |
---|
| 55 | \begin{lstlisting} |
---|
| 56 | void read_lock() { |
---|
| 57 | // Step 1 : make sure no writers in |
---|
| 58 | while write_lock { Pause(); } |
---|
| 59 | |
---|
| 60 | // May need fence here |
---|
| 61 | |
---|
| 62 | // Step 2 : acquire our local lock |
---|
| 63 | while atomic_xchg( tls.lock ) { |
---|
| 64 | Pause(); |
---|
| 65 | } |
---|
| 66 | } |
---|
| 67 | |
---|
| 68 | void read_unlock() { |
---|
| 69 | tls.lock = false; |
---|
| 70 | } |
---|
| 71 | \end{lstlisting} |
---|
| 72 | |
---|
| 73 | \begin{lstlisting} |
---|
| 74 | void write_lock() { |
---|
| 75 | // Step 1 : lock global lock |
---|
| 76 | while atomic_xchg( write_lock ) { |
---|
| 77 | Pause(); |
---|
| 78 | } |
---|
| 79 | |
---|
| 80 | // Step 2 : lock per-proc locks |
---|
| 81 | for t in all_tls { |
---|
| 82 | while atomic_xchg( t.lock ) { |
---|
| 83 | Pause(); |
---|
| 84 | } |
---|
| 85 | } |
---|
| 86 | } |
---|
| 87 | |
---|
| 88 | void write_unlock() { |
---|
| 89 | // Step 1 : release local locks |
---|
| 90 | for t in all_tls { |
---|
| 91 | t.lock = false; |
---|
| 92 | } |
---|
| 93 | |
---|
| 94 | // Step 2 : release global lock |
---|
| 95 | write_lock = false; |
---|
| 96 | } |
---|
| 97 | \end{lstlisting} |
---|
| 98 | |
---|
| 99 | \section{Idle-Sleep} |
---|
[bfd5512] | 100 | In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful. |
---|
| 101 | While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time. |
---|
[622a358] | 102 | Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice. |
---|
| 103 | Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready. |
---|
[bfd5512] | 104 | This state is referred to as \newterm{Idle-Sleep}. |
---|
[6b06abe] | 105 | |
---|
[bfd5512] | 106 | Idle sleep effectively encompasses several challenges. |
---|
| 107 | First some data structure needs to keep track of all \procs that are in idle sleep. |
---|
| 108 | Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements. |
---|
| 109 | Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg \texttt{pthread\_cond\_wait}, pthread semaphores. |
---|
| 110 | The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity. |
---|
| 111 | Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time. |
---|
| 112 | This third challenge is however outside the scope of this thesis because developping a general heuristic is involved enough to justify its own work. |
---|
| 113 | The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work. |
---|
| 114 | |
---|
| 115 | \section{Sleeping} |
---|
[622a358] | 116 | As usual, the corner-stone of any feature related to the kernel is the choice of system call. |
---|
| 117 | In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options: |
---|
| 118 | |
---|
| 119 | \paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}} |
---|
| 120 | The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}. |
---|
| 121 | These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled. |
---|
| 122 | While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s. |
---|
| 123 | For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled. |
---|
| 124 | |
---|
| 125 | \subsection{\texttt{io\_uring} and Epoll} |
---|
| 126 | An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}. |
---|
| 127 | This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme. |
---|
| 128 | This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother. |
---|
| 129 | This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations. |
---|
| 130 | If not handled correctly, this can lead to the artificial files going out of sync. |
---|
[6b06abe] | 131 | |
---|
| 132 | \subsection{Event FDs} |
---|
[622a358] | 133 | Another interesting approach is to use an event file descriptor\cit{eventfd}. |
---|
| 134 | This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore. |
---|
| 135 | Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}. |
---|
| 136 | Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}. |
---|
| 137 | If a read is made while the buffer is already 0, the read blocks until a non-0 value is added. |
---|
| 138 | What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance. |
---|
| 139 | Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD. |
---|
| 140 | This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io. |
---|
| 141 | |
---|
| 142 | \begin{figure} |
---|
| 143 | \centering |
---|
| 144 | \input{idle1.pstex_t} |
---|
| 145 | \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock. |
---|
| 146 | Each \proc has a private event FD.} |
---|
| 147 | \label{fig:idle1} |
---|
| 148 | \end{figure} |
---|
[6b06abe] | 149 | |
---|
| 150 | |
---|
[622a358] | 151 | \section{Tracking Sleepers} |
---|
| 152 | Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc. |
---|
| 153 | The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at. |
---|
| 154 | Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake. |
---|
| 155 | As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking. |
---|
| 156 | |
---|
| 157 | Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers. |
---|
| 158 | Contention slowing down \procs attempting to sleep or wake-up can be tolerated. |
---|
| 159 | These \procs are not doing useful work and therefore not contributing to overall performance. |
---|
| 160 | However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost. |
---|
| 161 | |
---|
| 162 | \subsection{Sleepers List} |
---|
| 163 | Each cluster maintains a list of idle \procs, organized as a stack. |
---|
| 164 | This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times. |
---|
| 165 | Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible. |
---|
| 166 | The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible. |
---|
| 167 | This approach means that maintaining the list is fairly straightforward. |
---|
| 168 | The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock. |
---|
| 169 | |
---|
| 170 | This approach also simplifies notification. |
---|
| 171 | Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined. |
---|
| 172 | This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order. |
---|
| 173 | Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure. |
---|
| 174 | The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest. |
---|
| 175 | |
---|
| 176 | \subsection{Reducing Latency} |
---|
| 177 | As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios. |
---|
| 178 | Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way. |
---|
| 179 | Is it important to reduce latency and contention of the notification as much as possible. |
---|
| 180 | Figure~\ref{fig:idle1} shoes the basic idle sleep data structure. |
---|
| 181 | For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency. |
---|
| 182 | |
---|
| 183 | \begin{figure} |
---|
| 184 | \centering |
---|
| 185 | \input{idle2.pstex_t} |
---|
| 186 | \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.} |
---|
| 187 | \label{fig:idle2} |
---|
| 188 | \end{figure} |
---|
| 189 | |
---|
| 190 | The contention is mostly due to the lock on the list needing to be held to get to the head \proc. |
---|
| 191 | That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts. |
---|
| 192 | The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held. |
---|
| 193 | This trick cannot be used for waking \procs since they are not in a state where they can run \ats. |
---|
| 194 | However, it is worth nothing that notification does not strictly require accessing the list or the head \proc. |
---|
| 195 | Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}. |
---|
| 196 | To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call. |
---|
| 197 | |
---|
| 198 | \begin{figure} |
---|
| 199 | \centering |
---|
| 200 | \input{idle_state.pstex_t} |
---|
| 201 | \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.} |
---|
| 202 | \label{fig:idle:state} |
---|
| 203 | \end{figure} |
---|
| 204 | |
---|
| 205 | The next optimization that can be done is to avoid the latency of the event fd when possible. |
---|
| 206 | This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd. |
---|
| 207 | A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}. |
---|
| 208 | The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run. |
---|
| 209 | The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}. |
---|
| 210 | If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd. |
---|
| 211 | Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state. |
---|
| 212 | if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd. |
---|
| 213 | However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably. |
---|
| 214 | This leads to the final data structure shown in Figure~\ref{fig:idle}. |
---|
| 215 | |
---|
| 216 | \begin{figure} |
---|
| 217 | \centering |
---|
| 218 | \input{idle.pstex_t} |
---|
| 219 | \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock. |
---|
| 220 | Each \proc has a private event FD with a benaphore in front of it. |
---|
| 221 | The list also has an atomic pointer to the event fd and benaphore of the first \proc on the list.} |
---|
| 222 | \label{fig:idle} |
---|
| 223 | \end{figure} |
---|