Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/practice.tex @ 7a37fcb1

Last change on this file since 7a37fcb1 was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 2 years ago
Last corrections to my thesis... hopefully
Property mode set to `100644`
File size: 19.7 KB

Line
1	\chapter{Scheduling in practice}\label{practice}
2	The scheduling algorithm described in Chapter~\ref{core} addresses scheduling in a stable state.
3	This chapter addresses problems that occur when the system state changes.
4	Indeed the \CFA runtime supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
5	These changes affect the scheduling algorithm, which must dynamically alter its behaviour.
6
7	Specifically, \CFA supports adding \procs using the type @processor@, in both RAII and heap coding scenarios.
8	\begin{cfa}
9	{
10	processor p[4]; // 4 new kernel threads
11	... // execute on 4 processors
12	processor * dp = new( processor, 6 ); // 6 new kernel threads
13	... // execute on 10 processors
14	delete( dp ); // delete 6 kernel threads
15	... // execute on 4 processors
16	} // delete 4 kernel threads
17	\end{cfa}
18	Dynamically allocated processors can be deleted at any time, \ie their lifetime exceeds the block of creation.
19	The consequence is that the scheduler and \io subsystems must know when these \procs come in and out of existence and roll them into the appropriate scheduling algorithms.
20
21	\section{Manual Resizing}
22	Manual resizing is expected to be a rare operation.
23	Programmers normally create/delete processors on a cluster at startup/teardown.
24	Therefore, dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
25	As such, all internal scheduling arrays that are sized based on the number of \procs need to be @realloc@ed.
26	This requirement also means any references into these arrays, \eg pointers or indexes, may need to be updated if elements are moved for compaction or any other reason.
27
28	There are no performance requirements, within reason, for act of resizing itself, since it is expected to be rare.
29	However, this operation has strict correctness requirements, since updating and idle sleep can easily lead to deadlocks.
30	The resizing mechanism should also avoid, as much as possible any effect on performance when the number of \procs remains constant.
31	This last requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
32
33	\subsection{Read-Copy-Update}
34	One solution is to use the Read-Copy-Update pattern~\cite{wiki:rcu}.
35	This is a very common pattern that avoids large critical sections, which is why it is worth mentioning.
36	In this pattern, resizing is done by creating a copy of the internal data structures, \eg see Figure~\ref{fig:base-ts2}, updating the copy with the desired changes, and then attempting an Indiana Jones Switch to replace the original with the copy.
37	This approach has the advantage that it may not need any synchronization to do the switch, depending on how reclamation of the original copy is handled.
38	However, there is a race where \procs still use the original data structure after the copy is switched.
39	This race not only requires adding a memory-reclamation scheme, but it also requires that operations made on the stale original version are eventually moved to the copy.
40
41	Specifically, the original data structure must be kept until all \procs have witnessed the change.
42	This requirement is the \newterm{memory reclamation challenge} and means every operation needs \emph{some} form of synchronization.
43	If all operations need synchronization, then the overall cost of this technique is likely to be similar to an uncontended lock approach.
44	In addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
45	Especially merging sub-queues while having a minimal impact on fairness and locality.
46
47	For example, given a linked list, having a node enqueued onto the original and new list is not necessarily a problem depending on the chosen list structure.
48	If the list supports arbitrary insertions, then inconsistencies in the tail pointer do not break the list;
49	however, ordering may not be preserved.
50	Furthermore, nodes enqueued to the original queues eventually need to be uniquely transferred to the new queues, which may further perturb ordering.
51	Dequeuing is more challenging when nodes appear on both lists because of pending reclamation: dequeuing a node from one list does not remove it from the other nor is that node in the same place on the other list.
52	This situation can lead to multiple \procs dequeuing the same \at.
53	Fixing these challenges requires more synchronization or more indirection to the queues, plus coordinated searching to ensure unique elements.
54
55	\subsection{Readers-Writer Lock}
56	A simpler approach is to use a \newterm{Readers-Writer Lock}~\cite{wiki:rwlock}, where the resizing requires acquiring the lock as a writer while simply enqueueing/dequeuing \ats requires acquiring the lock as a reader.
57	Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
58	Since this approach is not a very complex challenge and an ad hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
59
60	To maximize reader scalability, readers should not contend with each other when attempting to acquire and release a critical section.
61	Achieving this goal requires that each reader have its own memory to mark as locked and unlocked.
62	The read-acquire possibly waits for a writer to finish the critical section and then acquires a reader's local spinlock.
63	The writer acquires the global lock, guaranteeing mutual exclusion among writers, and then acquires each of the local reader locks.
64	Acquiring all the local read-locks guarantees mutual exclusion among the readers and the writer, while the wait on the read side prevents readers from continuously starving the writer.
65	Figure~\ref{f:SpecializedReadersWriterLock} shows the outline for this specialized readers-writer lock.
66	The lock is nonblocking, so both readers and writers spin while the lock is held.
67	This very wide sharding strategy means that readers have very good locality since they only ever need to access two memory locations.
68
69	\begin{figure}
70	\begin{cfa}
71	void read_lock() {
72	// Step 1 : make sure no writers in
73	while write_lock { Pause(); }
74	// Step 2 : acquire our local lock
75	while atomic_xchg( tls.lock ) { Pause(); }
76	}
77	void read_unlock() {
78	tls.lock = false;
79	}
80	void write_lock() {
81	// Step 1 : lock global lock
82	while atomic_xchg( write_lock ) { Pause(); }
83	// Step 2 : lock per-proc locks
84	for t in all_tls {
85	while atomic_xchg( t.lock ) { Pause(); }
86	}
87	}
88	void write_unlock() {
89	// Step 1 : release local locks
90	for t in all_tls { t.lock = false; }
91	// Step 2 : release global lock
92	write_lock = false;
93	}
94	\end{cfa}
95	\caption{Specialized Readers-Writer Lock}
96	\label{f:SpecializedReadersWriterLock}
97	\end{figure}
98
99	\section{Idle-Sleep}\label{idlesleep}
100	While manual resizing of \procs is expected to be rare, the number of \ats can vary significantly over an application's lifetime, which means there are times when there are too few or too many \procs.
101	For this work, it is the application programmer's responsibility to manually create \procs, so if there are too few \procs, the application must address this issue.
102	This leaves too many \procs when there are not enough \ats for all the \procs to be useful.
103	These idle \procs cannot be removed because their lifetime is controlled by the application, and only the application knows when the number of \ats may increase or decrease.
104	While idle \procs can spin until work appears, this approach wastes energy, unnecessarily produces heat and prevents other applications from using the \gls{hthrd}.
105	Therefore, idle \procs are put into an idle state, called \newterm{Idle-Sleep}, where the \gls{kthrd} is blocked until the scheduler deems it is needed.
106
107	Idle sleep effectively encompasses several challenges.
108	First, a data structure needs to keep track of all \procs that are in idle sleep.
109	Because idle sleep is spurious, this data structure has strict performance requirements, in addition to strict correctness requirements.
110	Next, some mechanism is needed to block \glspl{kthrd}, \eg @pthread_cond_wait@ or a pthread semaphore.
111	The complexity here is to support user-level locking, timers, \io operations, and all other \CFA features with minimal complexity.
112	Finally, the scheduler needs a heuristic to determine when to block and unblock an appropriate number of \procs.
113	However, this third challenge is outside the scope of this thesis because developing a general heuristic is complex enough to justify its own work.
114	Therefore, the \CFA scheduler simply follows the ``Race-to-Idle''~\cite{Albers12} approach where a sleeping \proc is woken any time a \at becomes ready and \procs go to idle sleep anytime they run out of work.
115
116	An interesting subpart of this heuristic is what to do with bursts of \ats that become ready.
117	Since waking up a sleeping \proc can have notable latency, multiple \ats may become ready while a single \proc is waking up.
118	This fact raises the question: if many \procs are available, how many should be woken?
119	If the ready \ats will run longer than the wake-up latency, waking one \proc per \at will offer maximum parallelization.
120	If the ready \ats will run for a very short time, waking many \procs may be wasteful.
121	As mentioned, since a heuristic to handle these complex cases is outside the scope of this thesis, so the behaviour of the scheduler in this particular case is left unspecified.
122
123	\section{Sleeping}
124	As usual, the cornerstone of any feature related to the kernel is the choice of system call.
125	In terms of blocking a \gls{kthrd} until some event occurs, the Linux kernel has many available options.
126
127	\subsection{\lstinline{pthread_mutex}/\lstinline{pthread_cond}}
128	The classic option is to use some combination of the pthread mutual exclusion and synchronization locks, allowing a safe \park/\unpark of a \gls{kthrd} to/from a @pthread_cond@.
129	While this approach works for \glspl{kthrd} waiting among themselves, \io operations do not provide a mechanism to signal @pthread_cond@s.
130	For \io results to wake a \proc waiting on a @pthread_cond@ means a different \gls{kthrd} must be woken up first, which then signals the \proc.
131
132	\subsection{\lstinline{io_uring} and Epoll}
133	An alternative is to flip the problem on its head and block waiting for \io, using @io_uring@ or @epoll@.
134	This creates the inverse situation, where \io operations directly wake sleeping \procs but waking blocked \procs must use an indirect scheme.
135	This generally takes the form of creating a file descriptor, \eg, dummy file, pipe, or event fd, and using that file descriptor when \procs need to wake each other.
136	This leads to additional complexity because there can be a race between these artificial \io and genuine \io operations.
137	If not handled correctly, this can lead to artificial files getting delayed too long behind genuine files, resulting in longer latency.
138
139	\subsection{Event FDs}
140	Another interesting approach is to use an event file descriptor\cite{MAN:eventfd}.
141	This Linux feature is a file descriptor that behaves like \io, \ie, uses @read@ and @write@, but also behaves like a semaphore.
142	Indeed, all reads and writes must use word-sized values, \ie 64 or 32 bits.
143	Writes \emph{add} their values to a buffer using arithmetic addition versus buffer append, and reads zero-out the buffer and return the buffer values so far.\footnote{
144	This behaviour is without the \lstinline{EFD_SEMAPHORE} flag, which changes the behaviour of \lstinline{read} but is not needed for this work.}
145	If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
146	What makes this feature particularly interesting is that @io_uring@ supports the @IORING_REGISTER_EVENTFD@ command to register an event @fd@ to a particular instance.
147	Once that instance is registered, any \io completion results in @io_uring@ writing to the event @fd@.
148	This means that a \proc waiting on the event @fd@ can be \emph{directly} woken up by either other \procs or incoming \io.
149
150	\section{Tracking Sleepers}
151	Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly, it requires a concurrent \emph{handshake} so that no \at is stranded on a ready queue with no active \proc.
152	The classic challenge occurs when a \at is made ready while a \proc is going to sleep: there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
153	Since \ats can be made ready by timers, \io operations, or other events outside a cluster, this race can occur even if the \proc going to sleep is the only \proc awake.
154	As a result, improper handling of this race leads to all \procs going to sleep when there are ready \ats and the system deadlocks.
155
156	The handshake closing the race is done with both the notifier and the idle \proc executing two ordered steps.
157	The notifier first makes sure the newly ready \at is visible to \procs searching for \ats, and then attempts to notify an idle \proc.
158	On the other side, \procs make themselves visible as idle \procs and then search for any \ats they may have missed.
159	Unlike regular work-stealing, this search must be exhaustive to make sure that no pre-existing \at is missed.
160	These steps from both sides guarantee that if the search misses a newly ready \at, then the notifier is guaranteed to see at least one idle \proc.
161	Conversely, if the notifier does not see any idle \proc, then a \proc is guaranteed to find the new \at in its exhaustive search.
162
163	Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
164	Contention can be tolerated for \procs attempting to sleep or wake up because these \procs are not doing useful work, and therefore, not contributing to overall performance.
165	However, notifying, checking if a \proc must be woken-up, and doing so if needed, can significantly affect overall performance and must be low cost.
166
167	\subsection{Sleepers List}
168	Each cluster maintains a list of idle \procs, organized as a stack.
169	This ordering allows \procs at the tail to stay in idle sleep for extended periods while those at the head of the list wake up for bursts of activity.
170	Because of unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \procs handle as much of the work as possible.
171	The idle \procs maintain the stack of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
172	This approach means that maintaining the list is fairly straightforward.
173	The list can simply use a single lock per cluster and only \procs that are getting in and out of the idle state contend for that lock.
174
175	This approach also simplifies notification.
176	Indeed, \procs not only need to be notified when a new \at is readied, but must also be notified during manual resizing, so the \gls{kthrd} can be joined.
177	These requirements mean whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
178	Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
179	The single lock also means the notification process simply needs to wake up the desired idle \proc, using @pthread_cond_signal@, @write@ on an @fd@, \etc, and the \proc handles the rest.
180
181	\subsection{Reducing Latency}
182	As mentioned in this section, \procs going to sleep for extremely short periods is likely in certain scenarios.
183	Therefore, the latency of doing a system call to read from and write to an event @fd@ can negatively affect overall performance notably.
184	Hence, it is important to reduce latency and contention of the notification as much as possible.
185	Figure~\ref{fig:idle1} shows the basic idle-sleep data structure.
186	For the notifiers, this data structure can cause contention on the lock and the event @fd@ syscall can cause notable latency.
187
188	\begin{figure}
189	\centering
190	\input{idle1.pstex_t}
191	\caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put onto a doubly-linked stack protected by a lock.
192	Each \proc has a private event \lstinline{fd}.}
193	\label{fig:idle1}
194	\end{figure}
195
196	Contention occurs because the idle-list lock must be held to access the idle list, \eg by \procs attempting to go to sleep, \procs waking, or notification attempts.
197	The contention from the \procs attempting to go to sleep can be mitigated slightly by using @try_acquire@, so the \procs simply busy wait again searching for \ats if the lock is held.
198	This trick cannot be used when waking \procs since the waker needs to return immediately to what it was doing.
199
200	Interestingly, general notification, \ie waking any idle processor versus a specific one, does not strictly require modifying the list.
201	Here, contention can be reduced notably by having notifiers avoid the lock entirely by adding a pointer to the event @fd@ of the first idle \proc, as in Figure~\ref{fig:idle2}.
202	To avoid contention among notifiers, notifiers atomically exchange the pointer with @NULL@.
203	The first notifier succeeds on the exchange and obtains the @fd@ of an idle \proc;
204	hence, only one notifier contends on the system call.
205	This notifier writes to the @fd@ to wake a \proc.
206	The woken \proc then updates the atomic pointer, while it is updating the head of the list, as it removes itself from the list.
207	Notifiers that obtained a @NULL@ in the exchange simply move on knowing that another notifier is already waking a \proc.
208	This behaviour is equivalent to having multiple notifiers write to the @fd@ since reads consume all previous writes.
209	Note that with and without this atomic pointer, bursts of notification can lead to an unspecified number of \procs being woken up, depending on how the arrival notification compares with the latency of \procs waking up.
210	As mentioned in section~\ref{idlesleep}, there is no optimal approach to handle these bursts.
211	It is therefore difficult to justify the cost of any extra synchronization here.
212
213	\begin{figure}[t]
214	\centering
215	\input{idle2.pstex_t}
216	\caption[Improved Idle-Sleep Data Structure]{Improved Idle-Sleep Data Structure \smallskip\newline An atomic pointer is added to the list pointing to the Event FD of the first \proc on the list.}
217	\label{fig:idle2}
218	\end{figure}
219
220	The next optimization is to avoid the latency of the event @fd@, which can be done by adding what is effectively a binary benaphore\cite{schillings1996engineering} in front of the event @fd@.
221	The benaphore over the event @fd@ logically provides a three-state flag to avoid unnecessary system calls, where the states are expressed explicitly in Figure~\ref{fig:idle:state}.
222	A \proc begins its idle sleep by adding itself to the idle list before searching for a \at.
223	In the process of adding itself to the idle list, it sets the state flag to @SEARCH@.
224	If no \ats can be found during the search, the \proc then confirms it is going to sleep by atomically swapping the state to @SLEEP@.
225	If the previous state is still @SEARCH@, then the \proc does read the event @fd@.
226	Meanwhile, notifiers atomically exchange the state to the @AWAKE@ state.
227	If the previous state is @SLEEP@, then the notifier must write to the event @fd@.
228	However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event @fd@ can be omitted, which reduces latency notably.
229	These extensions lead to the final data structure shown in Figure~\ref{fig:idle}.
230
231	\begin{figure}
232	\centering
233	\input{idle_state.pstex_t}
234	\caption[Improved Idle-Sleep Latency]{Improved Idle-Sleep Latency \smallskip\newline A three-state flag is added to the event \lstinline{fd}.}
235	\label{fig:idle:state}
236	\end{figure}
237
238	\begin{figure}
239	\centering
240	\input{idle.pstex_t}
241	\caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
242	Each \proc has a private event \lstinline{fd} with a benaphore in front of it.
243	The list also has an atomic pointer to the event \lstinline{fd} and benaphore of the first \proc on the list.}
244	\label{fig:idle}
245	\end{figure}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: