- Timestamp:
- Sep 9, 2022, 8:17:53 PM (2 years ago)
- Branches:
- ADT, ast-experimental, master, pthread-emulation
- Children:
- 5548175
- Parents:
- 1260224
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/practice.tex
r1260224 r3e8dacc 16 16 } // delete 4 kernel threads 17 17 \end{cfa} 18 Dynamically allocated processors can be deleted a nany time, \ie their lifetime exceeds the block of creation.18 Dynamically allocated processors can be deleted at any time, \ie their lifetime exceeds the block of creation. 19 19 The consequence is that the scheduler and \io subsystems must know when these \procs come in and out of existence and roll them into the appropriate scheduling algorithms. 20 20 21 21 \section{Manual Resizing} 22 22 Manual resizing is expected to be a rare operation. 23 Programmers normally create/delete processors on a cluster sat startup/teardown.23 Programmers normally create/delete processors on a cluster at startup/teardown. 24 24 Therefore, dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state. 25 25 As such, all internal scheduling arrays that are sized based on the number of \procs need to be @realloc@ed. 26 This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or forany other reason.26 This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason. 27 27 28 28 There are no performance requirements, within reason, for resizing since it is expected to be rare. 29 29 However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks. 30 It should also avoid as much as possible any effect on performance when the number of \procs remain constant.30 It should also avoid as much as possible any effect on performance when the number of \procs remains constant. 31 31 This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays. 32 32 33 33 \subsection{Read-Copy-Update} 34 34 One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}. 35 In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempt an Indiana Jones Switch to replace the original with the copy.35 In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy. 36 36 This approach has the advantage that it may not need any synchronization to do the switch. 37 37 However, there is a race where \procs still use the original data structure after the copy is switched. 38 This race not only requires adding a memory-reclamation scheme, it also requires that operations made on the stale original version are eventually moved to the copy.38 This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy. 39 39 40 40 Specifically, the original data structure must be kept until all \procs have witnessed the change. … … 42 42 If all operations need synchronization, then the overall cost of this technique is likely to be similar to an uncontended lock approach. 43 43 In addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges. 44 Especially merging sub queues while having a minimal impact on fairness and locality.45 46 For example, given a linked -list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure.44 Especially merging sub-queues while having a minimal impact on fairness and locality. 45 46 For example, given a linked list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure. 47 47 If the list supports arbitrary insertions, then inconsistencies in the tail pointer do not break the list; 48 48 however, ordering may not be preserved. … … 58 58 59 59 To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section. 60 To achieve this goal requires each reader tohave its own memory to mark as locked and unlocked.61 The read 62 The write acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.60 Achieving this goal requires that each reader have its own memory to mark as locked and unlocked. 61 The read-acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock. 62 The writer acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks. 63 63 Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer. 64 64 Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock. 65 65 The lock in nonblocking, so both readers and writers spin while the lock is held. 66 This very wide sharding strategy means that readers have very good locality , since they only ever need to access two memory location.66 This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations. 67 67 68 68 \begin{figure} … … 98 98 \section{Idle-Sleep}\label{idlesleep} 99 99 While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs. 100 For this work, it is the program er's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.100 For this work, it is the programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue. 101 101 This leaves too many \procs when there are not enough \ats for all the \procs to be useful. 102 102 These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease. … … 114 114 115 115 An interesting sub-part of this heuristic is what to do with bursts of \ats that become ready. 116 Since waking up a sleeping \proc can have notable latency, it is possible multiple \atsbecome ready while a single \proc is waking up.117 This fact sbegs the question, if many \procs are available, how many should be woken?118 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum paralleli sation.116 Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up. 117 This fact begs the question, if many \procs are available, how many should be woken? 118 If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization. 119 119 If the ready \ats will run for a short very short time, waking many \procs may be wasteful. 120 120 As mentioned, a heuristic to handle these complex cases is outside the scope of this thesis, the behaviour of the scheduler in this particular case is left unspecified. 121 121 122 122 \section{Sleeping} 123 As usual, the corner -stone of any feature related to the kernel is the choice of system call.123 As usual, the cornerstone of any feature related to the kernel is the choice of system call. 124 124 In terms of blocking a \gls{kthrd} until some event occurs, the Linux kernel has many available options. 125 125 … … 127 127 The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe \park/\unpark of a \gls{kthrd} to/from a @pthread_cond@. 128 128 While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s. 129 For \io results to wake a \proc waiting on a @pthread_cond@ means a different \gls pl{kthrd} must be woken up first, which then signals the \proc.129 For \io results to wake a \proc waiting on a @pthread_cond@ means a different \gls{kthrd} must be woken up first, which then signals the \proc. 130 130 131 131 \subsection{\lstinline{io_uring} and Epoll} … … 139 139 Another interesting approach is to use an event file descriptor\cite{MAN:eventfd}. 140 140 This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore. 141 Indeed, all reads and writes must use aword-sized values, \ie 64 or 32 bits.142 Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero 141 Indeed, all reads and writes must use word-sized values, \ie 64 or 32 bits. 142 Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero-out the buffer and return the buffer values so far.\footnote{ 143 143 This behaviour is without the \lstinline{EFD_SEMAPHORE} flag, which changes the behaviour of \lstinline{read} but is not needed for this work.} 144 144 If a read is made while the buffer is already 0, the read blocks until a non-0 value is added. … … 148 148 149 149 \section{Tracking Sleepers} 150 Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.150 Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly, it requires a concurrent \emph{handshake} so that no \at is stranded on a ready queue with no active \proc. 151 151 The classic challenge occurs when a \at is made ready while a \proc is going to sleep: there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at. 152 152 Since \ats can be made ready by timers, \io operations, or other events outside a cluster, this race can occur even if the \proc going to sleep is the only \proc awake. … … 154 154 155 155 The handshake closing the race is done with both the notifier and the idle \proc executing two ordered steps. 156 The notifier first make sure the newly ready \at is visible to \procs searching for \ats, and then attemptto notify an idle \proc.156 The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc. 157 157 On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed. 158 158 Unlike regular work-stealing, this search must be exhaustive to make sure that pre-existing \at is missed. 159 159 These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc. 160 Convers ly, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search.160 Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search. 161 161 162 162 Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers. 163 Contention can be tolerated for \procs attempting to sleep or wake -up because these \procs are not doing useful work, and therefore, not contributing to overall performance.163 Contention can be tolerated for \procs attempting to sleep or wake up because these \procs are not doing useful work, and therefore, not contributing to overall performance. 164 164 However, notifying, checking if a \proc must be woken-up, and doing so if needed, can significantly affect overall performance and must be low cost. 165 165 166 166 \subsection{Sleepers List} 167 167 Each cluster maintains a list of idle \procs, organized as a stack. 168 This ordering allows \procs at the tail to stay in idle sleep for extended period of times while those at the head of the list wake-up for bursts of activity.168 This ordering allows \procs at the tail to stay in idle sleep for extended periods while those at the head of the list wake up for bursts of activity. 169 169 Because of unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \procs handle as much of the work as possible. 170 170 The idle \procs maintain the stack of sleepers among themselves and notifying a sleeping \proc takes as little work as possible. … … 173 173 174 174 This approach also simplifies notification. 175 Indeed, \procs not only need to be notif ywhen a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined.175 Indeed, \procs not only need to be notified when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined. 176 176 These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order. 177 177 Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure. 178 The single lock also means the notification process simply needs to wake -up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest.178 The single lock also means the notification process simply needs to wake up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest. 179 179 180 180 \subsection{Reducing Latency} 181 As mentioned in this section, \procs going to sleep for extremely short periods of timeis likely in certain scenarios.182 Therefore, the latency of doing a system call to read from and writ ing to an event @fd@ can negatively affect overall performance in a notable way.181 As mentioned in this section, \procs going to sleep for extremely short periods is likely in certain scenarios. 182 Therefore, the latency of doing a system call to read from and write to an event @fd@ can negatively affect overall performance notably. 183 183 Hence, it is important to reduce latency and contention of the notification as much as possible. 184 184 Figure~\ref{fig:idle1} shows the basic idle-sleep data structure. … … 205 205 The woken \proc then updates the atomic pointer, while it is updating the head of the list, as it removes itself from the list. 206 206 Notifiers that obtained a @NULL@ in the exchange simply move on knowing that another notifier is already waking a \proc. 207 This behaviour is equivalent to having multiple notifier write to the @fd@ since reads consume all previous writes.208 Note that with and without this atomic pointer, bursts of notification can lead to an unspecified number of \procs being woken up, depending on how the arrival notification compares with tthe latency of \procs waking up.207 This behaviour is equivalent to having multiple notifiers write to the @fd@ since reads consume all previous writes. 208 Note that with and without this atomic pointer, bursts of notification can lead to an unspecified number of \procs being woken up, depending on how the arrival notification compares with the latency of \procs waking up. 209 209 As mentioned in section~\ref{idlesleep}, there is no optimal approach to handle these bursts. 210 210 It is therefore difficult to justify the cost of any extra synchronization here. … … 218 218 219 219 The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cite{schillings1996engineering} in front of the event @fd@. 220 The benaphore over the event @fd@ logically provides a three state flag to avoid unnecessary system calls, where the states are expressed explicitin Figure~\ref{fig:idle:state}.221 A \proc begins its idle sleep by adding itself to the idle list before searching for a n\at.220 The benaphore over the event @fd@ logically provides a three-state flag to avoid unnecessary system calls, where the states are expressed explicitly in Figure~\ref{fig:idle:state}. 221 A \proc begins its idle sleep by adding itself to the idle list before searching for a \at. 222 222 In the process of adding itself to the idle list, it sets the state flag to @SEARCH@. 223 223 If no \ats can be found during the search, the \proc then confirms it is going to sleep by atomically swapping the state to @SLEEP@. 224 224 If the previous state is still @SEARCH@, then the \proc does read the event @fd@. 225 Meanwhile, notifiers atomically exchange the state to @AWAKE@ state.225 Meanwhile, notifiers atomically exchange the state to the @AWAKE@ state. 226 226 If the previous state is @SLEEP@, then the notifier must write to the event @fd@. 227 227 However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event @fd@ can be omitted, which reduces latency notably. 228 These extensions lead sto the final data structure shown in Figure~\ref{fig:idle}.228 These extensions lead to the final data structure shown in Figure~\ref{fig:idle}. 229 229 230 230 \begin{figure} 231 231 \centering 232 232 \input{idle_state.pstex_t} 233 \caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three 233 \caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three-state flag is added to the event \lstinline{fd}.} 234 234 \label{fig:idle:state} 235 235 \end{figure}
Note: See TracChangeset
for help on using the changeset viewer.