| 1 | \chapter{Scheduling in practice}\label{practice}
|
|---|
| 2 | The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state.
|
|---|
| 3 | However, it does not address problems that occur when the system changes state.
|
|---|
| 4 | Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
|
|---|
| 5 | This entails that the scheduling algorithm must support these transitions.
|
|---|
| 6 |
|
|---|
| 7 | More precise \CFA supports adding \procs using the RAII object @processor@.
|
|---|
| 8 | These objects can be created at any time and can be destroyed at any time.
|
|---|
| 9 | They are normally created as automatic stack variables, but this is not a requirement.
|
|---|
| 10 |
|
|---|
| 11 | The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence.
|
|---|
| 12 |
|
|---|
| 13 | \section{Manual Resizing}
|
|---|
| 14 | Manual resizing is expected to be a rare operation.
|
|---|
| 15 | Programmers are mostly expected to resize clusters on startup or teardown.
|
|---|
| 16 | Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
|
|---|
| 17 | As such all internal arrays that are sized based on the number of \procs need to be @realloc@ed.
|
|---|
| 18 | This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.
|
|---|
| 19 |
|
|---|
| 20 | There are no performance requirements, within reason, for resizing since it is expected to be rare.
|
|---|
| 21 | However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks.
|
|---|
| 22 | It should also avoid as much as possible any effect on performance when the number of \procs remain constant.
|
|---|
| 23 | This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
|
|---|
| 24 |
|
|---|
| 25 | \subsection{Read-Copy-Update}
|
|---|
| 26 | One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern.
|
|---|
| 27 | In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy.
|
|---|
| 28 | This approach potentially has the advantage that it may not need any synchronization to do the switch.
|
|---|
| 29 | However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in.
|
|---|
| 30 | This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy.
|
|---|
| 31 |
|
|---|
| 32 | For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering.
|
|---|
| 33 | Dequeing is more challenging.
|
|---|
| 34 | Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at.
|
|---|
| 35 | Fixing this requires more synchronization or more indirection on the queues.
|
|---|
| 36 |
|
|---|
| 37 | Another challenge is that the original must be kept until all \procs have witnessed the change.
|
|---|
| 38 | This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization.
|
|---|
| 39 | If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance.
|
|---|
| 40 | Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
|
|---|
| 41 | Especially merging subqueues while having a minimal impact on fairness and locality.
|
|---|
| 42 |
|
|---|
| 43 | \subsection{Read-Writer Lock}
|
|---|
| 44 | A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader.
|
|---|
| 45 | Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
|
|---|
| 46 | Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
|
|---|
| 47 |
|
|---|
| 48 | To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections.
|
|---|
| 49 | This effectively requires that each reader have its own piece of memory to mark as locked and unlocked.
|
|---|
| 50 | Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks.
|
|---|
| 51 | Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks.
|
|---|
| 52 | Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer.
|
|---|
| 53 | \todo{reference listings}
|
|---|
| 54 |
|
|---|
| 55 | \begin{lstlisting}
|
|---|
| 56 | void read_lock() {
|
|---|
| 57 | // Step 1 : make sure no writers in
|
|---|
| 58 | while write_lock { Pause(); }
|
|---|
| 59 |
|
|---|
| 60 | // May need fence here
|
|---|
| 61 |
|
|---|
| 62 | // Step 2 : acquire our local lock
|
|---|
| 63 | while atomic_xchg( tls.lock ) {
|
|---|
| 64 | Pause();
|
|---|
| 65 | }
|
|---|
| 66 | }
|
|---|
| 67 |
|
|---|
| 68 | void read_unlock() {
|
|---|
| 69 | tls.lock = false;
|
|---|
| 70 | }
|
|---|
| 71 | \end{lstlisting}
|
|---|
| 72 |
|
|---|
| 73 | \begin{lstlisting}
|
|---|
| 74 | void write_lock() {
|
|---|
| 75 | // Step 1 : lock global lock
|
|---|
| 76 | while atomic_xchg( write_lock ) {
|
|---|
| 77 | Pause();
|
|---|
| 78 | }
|
|---|
| 79 |
|
|---|
| 80 | // Step 2 : lock per-proc locks
|
|---|
| 81 | for t in all_tls {
|
|---|
| 82 | while atomic_xchg( t.lock ) {
|
|---|
| 83 | Pause();
|
|---|
| 84 | }
|
|---|
| 85 | }
|
|---|
| 86 | }
|
|---|
| 87 |
|
|---|
| 88 | void write_unlock() {
|
|---|
| 89 | // Step 1 : release local locks
|
|---|
| 90 | for t in all_tls {
|
|---|
| 91 | t.lock = false;
|
|---|
| 92 | }
|
|---|
| 93 |
|
|---|
| 94 | // Step 2 : release global lock
|
|---|
| 95 | write_lock = false;
|
|---|
| 96 | }
|
|---|
| 97 | \end{lstlisting}
|
|---|
| 98 |
|
|---|
| 99 | \section{Idle-Sleep}
|
|---|
| 100 | In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful.
|
|---|
| 101 | While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time.
|
|---|
| 102 | Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice.
|
|---|
| 103 | Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready.
|
|---|
| 104 | This state is referred to as \newterm{Idle-Sleep}.
|
|---|
| 105 |
|
|---|
| 106 | Idle sleep effectively encompasses several challenges.
|
|---|
| 107 | First some data structure needs to keep track of all \procs that are in idle sleep.
|
|---|
| 108 | Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements.
|
|---|
| 109 | Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg @pthread_cond_wait@, pthread semaphores.
|
|---|
| 110 | The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity.
|
|---|
| 111 | Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time.
|
|---|
| 112 | This third challenge is however outside the scope of this thesis because developping a general heuristic is involved enough to justify its own work.
|
|---|
| 113 | The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work.
|
|---|
| 114 |
|
|---|
| 115 | \section{Sleeping}
|
|---|
| 116 | As usual, the corner-stone of any feature related to the kernel is the choice of system call.
|
|---|
| 117 | In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options:
|
|---|
| 118 |
|
|---|
| 119 | \paragraph{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
|
|---|
| 120 | The most classic option is to use some combination of @pthread_mutex@ and @pthread_cond@.
|
|---|
| 121 | These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a @pthread_cond@ until signalled.
|
|---|
| 122 | While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal @pthread_cond@s.
|
|---|
| 123 | For \io results to wake a \proc waiting on a @pthread_cond@ means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
|
|---|
| 124 |
|
|---|
| 125 | \subsection{\lstinline{io_uring} and Epoll}
|
|---|
| 126 | An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or even @epoll@.
|
|---|
| 127 | This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme.
|
|---|
| 128 | This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother.
|
|---|
| 129 | This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations.
|
|---|
| 130 | If not handled correctly, this can lead to the artificial files going out of sync.
|
|---|
| 131 |
|
|---|
| 132 | \subsection{Event FDs}
|
|---|
| 133 | Another interesting approach is to use an event file descriptor\cit{eventfd}.
|
|---|
| 134 | This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
|
|---|
| 135 | Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}.
|
|---|
| 136 | Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{
|
|---|
| 137 | This is without the \lstinline{EFD_SEMAPHORE} flag. This flags changes the behavior of \lstinline{read} but is not needed for this work.}.
|
|---|
| 138 | If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
|
|---|
| 139 | What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command, to register an event fd to a particular instance.
|
|---|
| 140 | Once that instance is registered, any \io completion will result in @io\_uring@ writing to the event FD.
|
|---|
| 141 | This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io.
|
|---|
| 142 |
|
|---|
| 143 | \begin{figure}
|
|---|
| 144 | \centering
|
|---|
| 145 | \input{idle1.pstex_t}
|
|---|
| 146 | \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
|
|---|
| 147 | Each \proc has a private event FD.}
|
|---|
| 148 | \label{fig:idle1}
|
|---|
| 149 | \end{figure}
|
|---|
| 150 |
|
|---|
| 151 |
|
|---|
| 152 | \section{Tracking Sleepers}
|
|---|
| 153 | Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
|
|---|
| 154 | The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
|
|---|
| 155 | Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake.
|
|---|
| 156 | As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking.
|
|---|
| 157 |
|
|---|
| 158 | Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
|
|---|
| 159 | Contention slowing down \procs attempting to sleep or wake-up can be tolerated.
|
|---|
| 160 | These \procs are not doing useful work and therefore not contributing to overall performance.
|
|---|
| 161 | However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost.
|
|---|
| 162 |
|
|---|
| 163 | \subsection{Sleepers List}
|
|---|
| 164 | Each cluster maintains a list of idle \procs, organized as a stack.
|
|---|
| 165 | This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times.
|
|---|
| 166 | Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible.
|
|---|
| 167 | The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
|
|---|
| 168 | This approach means that maintaining the list is fairly straightforward.
|
|---|
| 169 | The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock.
|
|---|
| 170 |
|
|---|
| 171 | This approach also simplifies notification.
|
|---|
| 172 | Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined.
|
|---|
| 173 | This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
|
|---|
| 174 | Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
|
|---|
| 175 | The notification process then simply needs to wake-up the desired idle \proc, using @pthread_cond_signal@, @write@ on an fd, etc., and the \proc will handle the rest.
|
|---|
| 176 |
|
|---|
| 177 | \subsection{Reducing Latency}
|
|---|
| 178 | As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios.
|
|---|
| 179 | Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way.
|
|---|
| 180 | Is it important to reduce latency and contention of the notification as much as possible.
|
|---|
| 181 | Figure~\ref{fig:idle1} shoes the basic idle sleep data structure.
|
|---|
| 182 | For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency.
|
|---|
| 183 |
|
|---|
| 184 | \begin{figure}
|
|---|
| 185 | \centering
|
|---|
| 186 | \input{idle2.pstex_t}
|
|---|
| 187 | \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
|
|---|
| 188 | \label{fig:idle2}
|
|---|
| 189 | \end{figure}
|
|---|
| 190 |
|
|---|
| 191 | The contention is mostly due to the lock on the list needing to be held to get to the head \proc.
|
|---|
| 192 | That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts.
|
|---|
| 193 | The contentention from the \procs attempting to go to sleep can be mitigated slightly by using @try\_acquire@ instead, so the \procs simply continue searching for \ats if the lock is held.
|
|---|
| 194 | This trick cannot be used for waking \procs since they are not in a state where they can run \ats.
|
|---|
| 195 | However, it is worth nothing that notification does not strictly require accessing the list or the head \proc.
|
|---|
| 196 | Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}.
|
|---|
| 197 | To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to @null@ so only only notifier will contend on the system call.
|
|---|
| 198 |
|
|---|
| 199 | \begin{figure}
|
|---|
| 200 | \centering
|
|---|
| 201 | \input{idle_state.pstex_t}
|
|---|
| 202 | \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
|
|---|
| 203 | \label{fig:idle:state}
|
|---|
| 204 | \end{figure}
|
|---|
| 205 |
|
|---|
| 206 | The next optimization that can be done is to avoid the latency of the event fd when possible.
|
|---|
| 207 | This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.
|
|---|
| 208 | A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
|
|---|
| 209 | The flag starts in state @SEARCH@, while the \proc is searching for \ats to run.
|
|---|
| 210 | The \proc then confirms the sleep by atomically swaping the state to @SLEEP@.
|
|---|
| 211 | If the previous state was still @SEARCH@, then the \proc does read the event fd.
|
|---|
| 212 | Meanwhile, notifiers atomically exchange the state to @AWAKE@ state.
|
|---|
| 213 | if the previous state was @SLEEP@, then the notifier must write to the event fd.
|
|---|
| 214 | However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.
|
|---|
| 215 | This leads to the final data structure shown in Figure~\ref{fig:idle}.
|
|---|
| 216 |
|
|---|
| 217 | \begin{figure}
|
|---|
| 218 | \centering
|
|---|
| 219 | \input{idle.pstex_t}
|
|---|
| 220 | \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
|
|---|
| 221 | Each \proc has a private event FD with a benaphore in front of it.
|
|---|
| 222 | The list also has an atomic pointer to the event fd and benaphore of the first \proc on the list.}
|
|---|
| 223 | \label{fig:idle}
|
|---|
| 224 | \end{figure}
|
|---|