source: doc/theses/thierry_delisle_PhD/thesis/text/practice.tex @ d677355

Last change on this file since d677355 was d677355, checked in by Peter A. Buhr <pabuhr@…>, 21 months ago

proofread chapter practice, adjust figures, small changes in other files

  • Property mode set to 100644
File size: 17.3 KB
1\chapter{Scheduling in practice}\label{practice}
2The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state.
3This chapter addresses problems that occur when the system state changes.
4Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
5These changes affect the scheduling algorithm, which must dynamically alter its behaviour.
7In detail, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
10        processor p[4]; // 4 new kernel threads
11        ... // execute on 4 processors
12        processor * dp = new( processor, 6 ); // 6 new kernel threads
13        ... // execute on 10 processors
14        delete( dp );   // delete 6 kernel threads
15        ... // execute on 4 processors
16} // delete 4 kernel threads
18Dynamically allocated processors can be deleted an any time, \ie their lifetime exceeds the block of creation.
19The consequence is that the scheduler and \io subsystems must know when these \procs come in and out of existence and roll them into the appropriate scheduling algorithms.
21\section{Manual Resizing}
22Manual resizing is expected to be a rare operation.
23Programmers normally create/delete processors on a clusters at startup/teardown.
24Therefore, dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
25As such, all internal scheduling arrays that are sized based on the number of \procs need to be @realloc@ed.
26This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason.
27% \footnote{Indexes may still need fixing when shrinking because some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}
29There are no performance requirements, within reason, for resizing since it is expected to be rare.
30However, this operation has strict correctness requirements since updating and idle sleep can easily lead to deadlocks.
31It should also avoid as much as possible any effect on performance when the number of \procs remain constant.
32This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
35One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}.
36In this pattern, resizing is done by creating a copy of the internal data structures (\eg see Figure~\ref{fig:base-ts2}), updating the copy with the desired changes, and then attempt an Indiana Jones Switch to replace the original with the copy.
37This approach has the advantage that it may not need any synchronization to do the switch.
38However, there is a race where \procs still use the original data structure after the copy is switched.
39This race not only requires adding a memory-reclamation scheme, it also requires that operations made on the stale original version are eventually moved to the copy.
41Specifically, the original data structure must be kept until all \procs have witnessed the change.
42This requirement is the \newterm{memory reclamation challenge} and means every operation needs \emph{some} form of synchronization.
43If all operations need synchronization, then the overall cost of this technique is likely to be similar to an uncontended lock approach.
44In addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
45Especially merging subqueues while having a minimal impact on fairness and locality.
47For example, given a linked-list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure.
48If the list supports arbitrary insertions, then inconsistencies in the tail pointer do not break the list;
49however, ordering may not be preserved.
50Furthermore, nodes enqueued to the original queues eventually need to be uniquely transferred to the new queues, which may further perturb ordering.
51Dequeuing is more challenging when nodes appear on both lists because of pending reclamation: dequeuing a node from one list does not remove it from the other nor is that node in the same place on the other list.
52This situation can lead to multiple \procs dequeuing the same \at.
53Fixing these challenges requires more synchronization or more indirection to the queues, plus coordinated searching to ensure unique elements.
55\subsection{Readers-Writer Lock}
56A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader.
57Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
58Since this approach is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
60To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section.
61To achieve this goal requires each reader to have its own memory to mark as locked and unlocked.
62The read acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
63The write acquire acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
64Acquiring all the local read locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
66Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
67The lock in nonblocking, so both readers and writers spin while the lock is held.
68\todo{finish explanation}
72void read_lock() {
73        // Step 1 : make sure no writers in
74        while write_lock { Pause(); }
75        // Step 2 : acquire our local lock
76        while atomic_xchg( tls.lock ) { Pause(); }
78void read_unlock() {
79        tls.lock = false;
81void write_lock()  {
82        // Step 1 : lock global lock
83        while atomic_xchg( write_lock ) { Pause(); }
84        // Step 2 : lock per-proc locks
85        for t in all_tls {
86                while atomic_xchg( t.lock ) { Pause(); }
87        }
89void write_unlock() {
90        // Step 1 : release local locks
91        for t in all_tls { t.lock = false; }
92        // Step 2 : release global lock
93        write_lock = false;
96\caption{Specialized Readers-Writer Lock}
101While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs.
102For this work, it is the programer's responsibility to manually create \procs, so if there a too few \procs, the application must address this issue.
103This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
104These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
105While idle \procs can spin until work appears, this approach wastes the processor (from other applications), energy and heat.
106Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed.
108Idle sleep effectively encompasses several challenges.
109First, a data structure needs to keep track of all \procs that are in idle sleep.
110Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
111Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ on a pthread semaphore.
112The complexity here is to support \at parking and unparking, user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
113Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
114However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
115Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
118As usual, the corner-stone of any feature related to the kernel is the choice of system call.
119In terms of blocking a \gls{kthrd} until some event occurs, the Linux kernel has many available options.
122The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe park/unpark of a \gls{kthrd} to/from a @pthread_cond@.
123While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s.
124For \io results to wake a \proc waiting on a @pthread_cond@ means a different \glspl{kthrd} must be woken up first, which then signals the \proc.
126\subsection{\lstinline{io_uring} and Epoll}
127An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or @epoll@.
128This creates the inverse situation, where \io operations directly wake sleeping \procs but waking blocked \procs must use an indirect scheme.
129This generally takes the form of creating a file descriptor, \eg, dummy file, pipe, or event fd, and using that file descriptor when \procs need to wake each other.
130This leads to additional complexity because there can be a race between these artificial \io and genuine \io operations.
131If not handled correctly, this can lead to artificial files getting delaying too long behind genuine files, resulting in longer latency.
133\subsection{Event FDs}
134Another interesting approach is to use an event file descriptor\cit{eventfd}.
135This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
136Indeed, all reads and writes must use a word-sized values, \ie 64 or 32 bits.
137Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero out the buffer and return the buffer values so far.\footnote{
138This behaviour is without the \lstinline{EFD_SEMAPHORE} flag, which changes the behaviour of \lstinline{read} but is not needed for this work.}
139If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
140What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command to register an event @fd@ to a particular instance.
141Once that instance is registered, any \io completion results in @io_uring@ writing to the event @fd@.
142This means that a \proc waiting on the event @fd@ can be \emph{directly} woken up by either other \procs or incoming \io.
144\section{Tracking Sleepers}
145Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
146The classic challenge occurs when a \at is made ready while a \proc is going to sleep: there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
147Since \ats can be made ready by timers, \io operations, or other events outside a cluster, this race can occur even if the \proc going to sleep is the only \proc awake.
148As a result, improper handling of this race leads to all \procs going to sleep when there are ready \ats and the system deadlocks.
150Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
151Contention can be tolerated for \procs attempting to sleep or wake-up because these \procs are not doing useful work, and therefore, not contributing to overall performance.
152However, notifying, checking if a \proc must be woken-up, and doing so if needed, can significantly affect overall performance and must be low cost.
154\subsection{Sleepers List}
155Each cluster maintains a list of idle \procs, organized as a stack.
156This ordering allows \procs at the head of the list to stay constantly active and those at the tail to stay in idle sleep for extended period of times.
157Because of unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \procs handle as much of the work as possible.
158The idle \procs maintain the stack of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
159This approach means that maintaining the list is fairly straightforward.
160The list can simply use a single lock per cluster and only \procs that are getting in and out of the idle state contend for that lock.
162This approach also simplifies notification.
163Indeed, \procs not only need to be notify when a new \at is readied, but also must be notified during manual resizing, so the \gls{kthrd} can be joined.
164These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
165Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
166The single lock also means the notification process simply needs to wake-up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest.
168\subsection{Reducing Latency}
169As mentioned in this section, \procs going to sleep for extremely short periods of time is likely in certain scenarios.
170Therefore, the latency of doing a system call to read from and writing to an event @fd@ can negatively affect overall performance in a notable way.
171Hence, it is important to reduce latency and contention of the notification as much as possible.
172Figure~\ref{fig:idle1} shows the basic idle-sleep data structure.
173For the notifiers, this data structure can cause contention on the lock and the event @fd@ syscall can cause notable latency.
176        \centering
177        \input{idle1.pstex_t}
178        \caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
179        Each \proc has a private event \lstinline{fd}.}
180        \label{fig:idle1}
183Contention occurs because the idle-list lock must be held to access the idle list, \eg by \procs attempting to go to sleep, \procs waking, or notification attempts.
184The contention from the \procs attempting to go to sleep can be mitigated slightly by using @try_acquire@, so the \procs simply busy wait again searching for \ats if the lock is held.
185This trick cannot be used when waking \procs since the waker needs to return immediately to what it was doing.
186Interestingly, general notification, \ie waking any idle processor versus a specific one, does not strictly require modifying the list.
187Here, contention can be reduced notably by having notifiers avoid the lock entirely by adding a pointer to the event @fd@ of the first idle \proc, as in Figure~\ref{fig:idle2}.
188To avoid contention among notifiers, notifiers atomically exchange it to @NULL@ so only one notifier contends on the system call.
189\todo{Expand explanation of how a notification works.}
192        \centering
193        \input{idle2.pstex_t}
194        \caption[Improved Idle-Sleep Data Structure]{Improved Idle-Sleep Data Structure \smallskip\newline An atomic pointer is added to the list pointing to the Event FD of the first \proc on the list.}
195        \label{fig:idle2}
198The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event @fd@.
199A simple three state flag is added beside the event @fd@ to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
200In Topological Work Stealing (see Section~\ref{s:TopologicalWorkStealing}), a \proc without \ats begins searching by setting the state flag to @SEARCH@.
201If no \ats can be found to steal, the \proc then confirms it is going to sleep by atomically swapping the state to @SLEEP@.
202If the previous state is still @SEARCH@, then the \proc does read the event @fd@.
203Meanwhile, notifiers atomically exchange the state to @AWAKE@ state.
204If the previous state is @SLEEP@, then the notifier must write to the event @fd@.
205However, if the notify arrives almost immediately after the \proc marks itself sleeping (idle), then both reads and writes on the event @fd@ can be omitted, which reduces latency notably.
206These extensions leads to the final data structure shown in Figure~\ref{fig:idle}.
207\todo{You never talk about the Beaphore. What is its purpose and when is it used?}
210        \centering
211        \input{idle_state.pstex_t}
212        \caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three state flag is added to the event \lstinline{fd}.}
213        \label{fig:idle:state}
217        \centering
218        \input{idle.pstex_t}
219        \caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
220        Each \proc has a private event \lstinline{fd} with a benaphore in front of it.
221        The list also has an atomic pointer to the event \lstinline{fd} and benaphore of the first \proc on the list.}
222        \label{fig:idle}
Note: See TracBrowser for help on using the repository browser.