- Timestamp:
- Jul 25, 2022, 3:17:25 PM (3 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation, qualifiedEnum
- Children:
- b0d9ff7
- Parents:
- 4e2befe3 (diff), ffec1bf (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r4e2befe3 rdef751f 1 1 \chapter{Scheduling in practice}\label{practice} 2 The scheduling algorithm d iscribed in Chapter~\ref{core} addresses scheduling in a stable state.3 However, it does not address problems that occur when the system changes state.2 The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state. 3 This chapter addresses problems that occur when the system state changes. 4 4 Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically. 5 This entails that the scheduling algorithm must support these transitions. 6 7 More precise \CFA supports adding \procs using the RAII object @processor@. 8 These objects can be created at any time and can be destroyed at any time. 9 They are normally created as automatic stack variables, but this is not a requirement. 10 11 The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence. 5 These changes affect the scheduling algorithm, which must dynamically alter its behaviour. 6 7 In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios. 8 \begin{cfa} 9 { 10 processor p[4]; // 4 new kernel threads 11 ... // execute on 4 processors 12 processor * dp = new( processor, 6 ); // 6 new kernel threads 13 ... // execute on 10 processors 14 delete( dp ); // delete 6 kernel threads 15 ... // execute on 4 processors 16 } // delete 4 kernel threads 17 \end{cfa} 18 Dynamically allocated processors can be deleted an any time, \ie their lifetime exceeds the block of creation. 19 The consequence is that the scheduler and \io subsystems must know when these \procs come in and out of existence and roll them into the appropriate scheduling algorithms. 12 20 13 21 \section{Manual Resizing} 14 22 Manual resizing is expected to be a rare operation. 15 Programmers are mostly expected to resize clusters on startup orteardown.16 Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.17 As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed.18 This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.23 Programmers normally create/delete processors on a clusters at startup/teardown. 24 Therefore, dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state. 25 As such, all internal scheduling arrays that are sized based on the number of \procs need to be @realloc@ed. 26 This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or for any other reason. 19 27 20 28 There are no performance requirements, within reason, for resizing since it is expected to be rare. 21 However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks.29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks. 22 30 It should also avoid as much as possible any effect on performance when the number of \procs remain constant. 23 31 This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays. 24 32 25 33 \subsection{Read-Copy-Update} 26 One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern. 27 In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy. 28 This approach potentially has the advantage that it may not need any synchronization to do the switch. 29 However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in. 30 This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy. 31 32 For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering. 33 Dequeing is more challenging. 34 Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at. 35 Fixing this requires more synchronization or more indirection on the queues. 36 37 Another challenge is that the original must be kept until all \procs have witnessed the change. 38 This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization. 39 If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance. 40 Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges. 34 One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}. 35 In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempt an Indiana Jones Switch to replace the original with the copy. 36 This approach has the advantage that it may not need any synchronization to do the switch. 37 However, there is a race where \procs still use the original data structure after the copy is switched. 38 This race not only requires adding a memory-reclamation scheme, it also requires that operations made on the stale original version are eventually moved to the copy. 39 40 Specifically, the original data structure must be kept until all \procs have witnessed the change. 41 This requirement is the \newterm{memory reclamation challenge} and means every operation needs \emph{some} form of synchronization. 42 If all operations need synchronization, then the overall cost of this technique is likely to be similar to an uncontended lock approach. 43 In addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges. 41 44 Especially merging subqueues while having a minimal impact on fairness and locality. 42 45 43 \subsection{Read-Writer Lock} 44 A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader. 46 For example, given a linked-list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure. 47 If the list supports arbitrary insertions, then inconsistencies in the tail pointer do not break the list; 48 however, ordering may not be preserved. 49 Furthermore, nodes enqueued to the original queues eventually need to be uniquely transferred to the new queues, which may further perturb ordering. 50 Dequeuing is more challenging when nodes appear on both lists because of pending reclamation: dequeuing a node from one list does not remove it from the other nor is that node in the same place on the other list. 51 This situation can lead to multiple \procs dequeuing the same \at. 52 Fixing these challenges requires more synchronization or more indirection to the queues, plus coordinated searching to ensure unique elements. 53 54 \subsection{Readers-Writer Lock} 55 A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader. 45 56 Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance. 46 Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken. 47 48 To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections. 49 This effectively requires that each reader have its own piece of memory to mark as locked and unlocked. 50 Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks. 51 Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks. 52 Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer. 53 \todo{reference listings} 54 55 \begin{lstlisting} 57 Since this approach is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken. 58 59 To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section. 60 To achieve this goal requires each reader to have its own memory to mark as locked and unlocked. 61 The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock. 62 The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks. 63 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 65 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 66 The lock in nonblocking, so both readers and writers spin while the lock is held. 67 \todo{finish explanation} 68 69 \begin{figure} 70 \begin{cfa} 56 71 void read_lock() { 57 72 // Step 1 : make sure no writers in 58 73 while write_lock { Pause(); } 59 60 // May need fence here61 62 74 // Step 2 : acquire our local lock 63 while atomic_xchg( tls.lock ) { 64 Pause(); 65 } 66 } 67 75 while atomic_xchg( tls.lock ) { Pause(); } 76 } 68 77 void read_unlock() { 69 78 tls.lock = false; 70 79 } 71 \end{lstlisting}72 73 \begin{lstlisting}74 80 void write_lock() { 75 81 // Step 1 : lock global lock 76 while atomic_xchg( write_lock ) { 77 Pause(); 78 } 79 82 while atomic_xchg( write_lock ) { Pause(); } 80 83 // Step 2 : lock per-proc locks 81 84 for t in all_tls { 82 while atomic_xchg( t.lock ) { 83 Pause(); 84 } 85 while atomic_xchg( t.lock ) { Pause(); } 85 86 } 86 87 } 87 88 88 void write_unlock() { 89 89 // Step 1 : release local locks 90 for t in all_tls { 91 t.lock = false; 92 } 93 90 for t in all_tls { t.lock = false; } 94 91 // Step 2 : release global lock 95 92 write_lock = false; 96 93 } 97 \end{lstlisting} 98 99 \section{Idle-Sleep} 100 In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful. 101 While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time. 102 Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice. 103 Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready. 104 This state is referred to as \newterm{Idle-Sleep}. 94 \end{cfa} 95 \caption{Specialized Readers-Writer Lock} 96 \label{f:SpecializedReadersWriterLock} 97 \end{figure} 98 99 \section{Idle-Sleep}\label{idlesleep} 100 While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs. 101 For this work, it is the programer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue. 102 This leaves too many \procs when there are not enough \ats for all the \procs to be useful. 103 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease. 104 While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the processor. 105 Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed. 105 106 106 107 Idle sleep effectively encompasses several challenges. 107 First some data structure needs to keep track of all \procs that are in idle sleep. 108 Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements. 109 Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg \texttt{pthread\_cond\_wait}, pthread semaphores. 110 The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity. 111 Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time. 112 This third challenge is however outside the scope of this thesis because developping a general heuristic is involved enough to justify its own work. 113 The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work. 108 First, a data structure needs to keep track of all \procs that are in idle sleep. 109 Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements. 110 Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore. 111 The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity. 112 Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs. 113 However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work. 114 Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work. 115 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready. 116 Since waking up a sleeping \proc can have notable latency, it is possible multiple \ats become ready while a single \proc is waking up. 117 This facts begs the question, if many \procs are available, how many should be woken? 118 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelisation. 119 If the ready \ats will run for a short very short time, waking many \procs may be wasteful. 120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified. 114 121 115 122 \section{Sleeping} 116 123 As usual, the corner-stone of any feature related to the kernel is the choice of system call. 117 In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options: 118 119 \paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}} 120 The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}. 121 These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled. 122 While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s. 123 For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled. 124 125 \subsection{\texttt{io\_uring} and Epoll} 126 An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}. 127 This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme. 128 This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother. 129 This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations. 130 If not handled correctly, this can lead to the artificial files going out of sync. 124 In terms of blocking a \gls{kthrd} until some event occurs, the Linux kernel has many available options. 125 126 \subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}} 127 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@. 128 While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s. 129 For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc. 130 131 \subsection{\lstinline{io_uring} and Epoll} 132 An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or @epoll@. 133 This creates the inverse situation, where \io operations directly wake sleeping \procs but waking blocked \procs must use an indirect scheme. 134 This generally takes the form of creating a file descriptor, \eg, dummy file, pipe, or event fd, and using that file descriptor when \procs need to wake each other. 135 This leads to additional complexity because there can be a race between these artificial \io and genuine \io operations. 136 If not handled correctly, this can lead to artificial files getting delayed too long behind genuine files, resulting in longer latency. 131 137 132 138 \subsection{Event FDs} 133 139 Another interesting approach is to use an event file descriptor\cit{eventfd}. 134 This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore. 135 Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}. 136 Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}. 140 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore. 141 Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits. 142 Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero out the buffer and return the buffer values so far.\footnote{ 143 This behaviour is without the \lstinline{EFD_SEMAPHORE} flag, which changes the behaviour of \lstinline{read} but is not needed for this work.} 137 144 If a read is made while the buffer is already 0, the read blocks until a non-0 value is added. 138 What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance. 139 Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD. 140 This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io. 145 What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command to register an event @fd@ to a particular instance. 146 Once that instance is registered, any \io completion results in @io_uring@ writing to the event @fd@. 147 This means that a \proc waiting on the event @fd@ can be \emph{directly} woken up by either other \procs or incoming \io. 148 149 \section{Tracking Sleepers} 150 Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc. 151 The classic challenge occurs when a \at is made ready while a \proc is going to sleep: there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at. 152 Since \ats can be made ready by timers, \io operations, or other events outside a cluster, this race can occur even if the \proc going to sleep is the only \proc awake. 153 As a result, improper handling of this race leads to all \procs going to sleep when there are ready \ats and the system deadlocks. 154 155 The handshake closing the race is done with both the notifier and the idle \proc executing two ordered steps. 156 The notifier first make sure the newly ready \at is visible to \procs searching for \ats, and then attempt to notify an idle \proc. 157 On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed. 158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed. 159 These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc. 160 Conversly, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search. 161 162 Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers. 163 Contention can be tolerated for \procs attempting to sleep or wake-up because these \procs are not doing useful work, and therefore, not contributing to overall performance. 164 However, notifying, checking if a \proc must be woken-up, and doing so if needed, can significantly affect overall performance and must be low cost. 165 166 \subsection{Sleepers List} 167 Each cluster maintains a list of idle \procs, organized as a stack. 168 This ordering allows \procs at the tail to stay in idle sleep for extended period of times while those at the head of the list wake-up for bursts of activity. 169 Because of unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \procs handle as much of the work as possible. 170 The idle \procs maintain the stack of sleepers among themselves and notifying a sleeping \proc takes as little work as possible. 171 This approach means that maintaining the list is fairly straightforward. 172 The list can simply use a single lock per cluster and only \procs that are getting in and out of the idle state contend for that lock. 173 174 This approach also simplifies notification. 175 Indeed, \procs not only need to be notify when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined. 176 These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order. 177 Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure. 178 The single lock also means the notification process simply needs to wake-up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest. 179 180 \subsection{Reducing Latency} 181 As mentioned in this section, \procs going to sleep for extremely short periods of time is likely in certain scenarios. 182 Therefore, the latency of doing a system call to read from and writing to an event @fd@ can negatively affect overall performance in a notable way. 183 Hence, it is important to reduce latency and contention of the notification as much as possible. 184 Figure~\ref{fig:idle1} shows the basic idle-sleep data structure. 185 For the notifiers, this data structure can cause contention on the lock and the event @fd@ syscall can cause notable latency. 141 186 142 187 \begin{figure} … … 144 189 \input{idle1.pstex_t} 145 190 \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock. 146 Each \proc has a private event FD.}191 Each \proc has a private event \lstinline{fd}.} 147 192 \label{fig:idle1} 148 193 \end{figure} 149 194 150 151 \section{Tracking Sleepers} 152 Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc. 153 The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at. 154 Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake. 155 As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking. 156 157 Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers. 158 Contention slowing down \procs attempting to sleep or wake-up can be tolerated. 159 These \procs are not doing useful work and therefore not contributing to overall performance. 160 However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost. 161 162 \subsection{Sleepers List} 163 Each cluster maintains a list of idle \procs, organized as a stack. 164 This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times. 165 Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible. 166 The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible. 167 This approach means that maintaining the list is fairly straightforward. 168 The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock. 169 170 This approach also simplifies notification. 171 Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined. 172 This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order. 173 Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure. 174 The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest. 175 176 \subsection{Reducing Latency} 177 As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios. 178 Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way. 179 Is it important to reduce latency and contention of the notification as much as possible. 180 Figure~\ref{fig:idle1} shoes the basic idle sleep data structure. 181 For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency. 182 183 \begin{figure} 195 Contention occurs because the idle-list lock must be held to access the idle list, \eg by \procs attempting to go to sleep, \procs waking, or notification attempts. 196 The contention from the \procs attempting to go to sleep can be mitigated slightly by using @try_acquire@, so the \procs simply busy wait again searching for \ats if the lock is held. 197 This trick cannot be used when waking \procs since the waker needs to return immediately to what it was doing. 198 199 Interestingly, general notification, \ie waking any idle processor versus a specific one, does not strictly require modifying the list. 200 Here, contention can be reduced notably by having notifiers avoid the lock entirely by adding a pointer to the event @fd@ of the first idle \proc, as in Figure~\ref{fig:idle2}. 201 To avoid contention among notifiers, notifiers atomically exchange the pointer with @NULL@. 202 The first notifier succeeds on the exchange and obtains the @fd@ of an idle \proc; 203 hence, only one notifier contends on the system call. 204 This notifier writes to the @fd@ to wake a \proc. 205 The woken \proc then updates the atomic pointer, while it is updating the head of the list, as it removes itself from the list. 206 Notifiers that obtained a @NULL@ in the exchange simply move on knowing that another notifier is already waking a \proc. 207 This behaviour is equivalent to having multiple notifier write to the @fd@ since reads consume all previous writes. 208 Note that with and without this atomic pointer, bursts of notification can lead to an unspecified number of \procs being woken up, depending on how the arrival notification compares witht the latency of \procs waking up. 209 As mentioned in section~\ref{idlesleep}, there is no optimal approach to handle these bursts. 210 It is therefore difficult to justify the cost of any extra synchronization here. 211 212 \begin{figure}[t] 184 213 \centering 185 214 \input{idle2.pstex_t} 186 \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list,pointing to the Event FD of the first \proc on the list.}215 \caption[Improved Idle-Sleep Data Structure]{Improved Idle-Sleep Data Structure \smallskip\newline An atomic pointer is added to the list pointing to the Event FD of the first \proc on the list.} 187 216 \label{fig:idle2} 188 217 \end{figure} 189 218 190 The contention is mostly due to the lock on the list needing to be held to get to the head \proc. 191 That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts. 192 The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held. 193 This trick cannot be used for waking \procs since they are not in a state where they can run \ats. 194 However, it is worth nothing that notification does not strictly require accessing the list or the head \proc. 195 Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}. 196 To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call. 219 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cit{benaphore} in front of the event @fd@. 220 The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicit in Figure~\ref{fig:idle:state}. 221 A \proc begins its idle sleep by adding itself to the idle list before searching for an \at. 222 In the process of adding itself to the idle list, it sets the state flag to @SEARCH@. 223 If no \ats can be found during the search, the \proc then confirms it is going to sleep by atomically swapping the state to @SLEEP@. 224 If the previous state is still @SEARCH@, then the \proc does read the event @fd@. 225 Meanwhile, notifiers atomically exchange the state to @AWAKE@ state. 226 If the previous state is @SLEEP@, then the notifier must write to the event @fd@. 227 However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event @fd@ can be omitted, which reduces latency notably. 228 These extensions leads to the final data structure shown in Figure~\ref{fig:idle}. 197 229 198 230 \begin{figure} 199 231 \centering 200 232 \input{idle_state.pstex_t} 201 \caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}233 \caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three state flag is added to the event \lstinline{fd}.} 202 234 \label{fig:idle:state} 203 235 \end{figure} 204 205 The next optimization that can be done is to avoid the latency of the event fd when possible.206 This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.207 A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.208 The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run.209 The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}.210 If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd.211 Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state.212 if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd.213 However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.214 This leads to the final data structure shown in Figure~\ref{fig:idle}.215 236 216 237 \begin{figure} … … 218 239 \input{idle.pstex_t} 219 240 \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock. 220 Each \proc has a private event FDwith a benaphore in front of it.221 The list also has an atomic pointer to the event fdand benaphore of the first \proc on the list.}241 Each \proc has a private event \lstinline{fd} with a benaphore in front of it. 242 The list also has an atomic pointer to the event \lstinline{fd} and benaphore of the first \proc on the list.} 222 243 \label{fig:idle} 223 244 \end{figure}
Note:
See TracChangeset
for help on using the changeset viewer.