Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/practice.tex @ 2a859b5

ADTast-experimentalpthread-emulationqualifiedEnum

Last change on this file since 2a859b5 was 622a358, checked in by Thierry Delisle <tdelisle@…>, 2 years ago
A whole lot of results and some text section done
Property mode set to `100644`
File size: 16.0 KB

Line
1	\chapter{Scheduling in practice}\label{practice}
2	The scheduling algorithm discribed in Chapter~\ref{core} addresses scheduling in a stable state.
3	However, it does not address problems that occur when the system changes state.
4	Indeed the \CFA runtime, supports expanding and shrinking the number of \procs, both manually and, to some extent, automatically.
5	This entails that the scheduling algorithm must support these transitions.
6
7	More precise \CFA supports adding \procs using the RAII object @processor@.
8	These objects can be created at any time and can be destroyed at any time.
9	They are normally created as automatic stack variables, but this is not a requirement.
10
11	The consequence is that the scheduler and \io subsystems must support \procs comming in and out of existence.
12
13	\section{Manual Resizing}
14	Manual resizing is expected to be a rare operation.
15	Programmers are mostly expected to resize clusters on startup or teardown.
16	Therefore dynamically changing the number of \procs is an appropriate moment to allocate or free resources to match the new state.
17	As such all internal arrays that are sized based on the number of \procs need to be \texttt{realloc}ed.
18	This also means that any references into these arrays, pointers or indexes, may need to be fixed when shrinking\footnote{Indexes may still need fixing when shrinkingbecause some indexes are expected to refer to dense contiguous resources and there is no guarantee the resource being removed has the highest index.}.
19
20	There are no performance requirements, within reason, for resizing since it is expected to be rare.
21	However, this operation has strict correctness requirements since shrinking and idle sleep can easily lead to deadlocks.
22	It should also avoid as much as possible any effect on performance when the number of \procs remain constant.
23	This later requirement prohibits naive solutions, like simply adding a global lock to the ready-queue arrays.
24
25	\subsection{Read-Copy-Update}
26	One solution is to use the Read-Copy-Update\cite{wiki:rcu} pattern.
27	In this pattern, resizing is done by creating a copy of the internal data strucures, updating the copy with the desired changes, and then attempt an Idiana Jones Switch to replace the original witht the copy.
28	This approach potentially has the advantage that it may not need any synchronization to do the switch.
29	However, there is a race where \procs could still use the previous, original, data structure after the copy was switched in.
30	This race not only requires some added memory reclamation scheme, it also requires that operations made on the stale original version be eventually moved to the copy.
31
32	For linked-lists, enqueing is only somewhat problematic, \ats enqueued to the original queues need to be transferred to the new, which might not preserve ordering.
33	Dequeing is more challenging.
34	Dequeing from the original will not necessarily update the copy which could lead to multiple \procs dequeing the same \at.
35	Fixing this requires more synchronization or more indirection on the queues.
36
37	Another challenge is that the original must be kept until all \procs have witnessed the change.
38	This is a straight forward memory reclamation challenge but it does mean that every operation will need \emph{some} form of synchronization.
39	If each of these operation does need synchronization then it is possible a simpler solution achieves the same performance.
40	Because in addition to the classic challenge of memory reclamation, transferring the original data to the copy before reclaiming it poses additional challenges.
41	Especially merging subqueues while having a minimal impact on fairness and locality.
42
43	\subsection{Read-Writer Lock}
44	A simpler approach would be to use a \newterm{Readers-Writer Lock}\cite{wiki:rwlock} where the resizing requires acquiring the lock as a writer while simply enqueing/dequeing \ats requires acquiring the lock as a reader.
45	Using a Readers-Writer lock solves the problem of dynamically resizing and leaves the challenge of finding or building a lock with sufficient good read-side performance.
46	Since this is not a very complex challenge and an ad-hoc solution is perfectly acceptable, building a Readers-Writer lock was the path taken.
47
48	To maximize reader scalability, the readers should not contend with eachother when attempting to acquire and release the critical sections.
49	This effectively requires that each reader have its own piece of memory to mark as locked and unlocked.
50	Reades then acquire the lock wait for writers to finish the critical section and then acquire their local spinlocks.
51	Writers acquire the global lock, so writers have mutual exclusion among themselves, and then acquires each of the local reader locks.
52	Acquiring all the local locks guarantees mutual exclusion between the readers and the writer, while the wait on the read side prevents readers from continously starving the writer.
53	\todo{reference listings}
54
55	\begin{lstlisting}
56	void read_lock() {
57	// Step 1 : make sure no writers in
58	while write_lock { Pause(); }
59
60	// May need fence here
61
62	// Step 2 : acquire our local lock
63	while atomic_xchg( tls.lock ) {
64	Pause();
65	}
66	}
67
68	void read_unlock() {
69	tls.lock = false;
70	}
71	\end{lstlisting}
72
73	\begin{lstlisting}
74	void write_lock() {
75	// Step 1 : lock global lock
76	while atomic_xchg( write_lock ) {
77	Pause();
78	}
79
80	// Step 2 : lock per-proc locks
81	for t in all_tls {
82	while atomic_xchg( t.lock ) {
83	Pause();
84	}
85	}
86	}
87
88	void write_unlock() {
89	// Step 1 : release local locks
90	for t in all_tls {
91	t.lock = false;
92	}
93
94	// Step 2 : release global lock
95	write_lock = false;
96	}
97	\end{lstlisting}
98
99	\section{Idle-Sleep}
100	In addition to users manually changing the number of \procs, it is desireable to support ``removing'' \procs when there is not enough \ats for all the \procs to be useful.
101	While manual resizing is expected to be rare, the number of \ats is expected to vary much more which means \procs may need to be ``removed'' for only short periods of time.
102	Furthermore, race conditions that spuriously lead to the impression that no \ats are ready are actually common in practice.
103	Therefore resources associated with \procs should not be freed but \procs simply put into an idle state where the \gls{kthrd} is blocked until more \ats become ready.
104	This state is referred to as \newterm{Idle-Sleep}.
105
106	Idle sleep effectively encompasses several challenges.
107	First some data structure needs to keep track of all \procs that are in idle sleep.
108	Because of idle sleep can be spurious, this data structure has strict performance requirements in addition to the strict correctness requirements.
109	Next, some tool must be used to block kernel threads \glspl{kthrd}, \eg \texttt{pthread\_cond\_wait}, pthread semaphores.
110	The complexity here is to support \at parking and unparking, timers, \io operations and all other \CFA features with minimal complexity.
111	Finally, idle sleep also includes a heuristic to determine the appropriate number of \procs to be in idle sleep an any given time.
112	This third challenge is however outside the scope of this thesis because developping a general heuristic is involved enough to justify its own work.
113	The \CFA scheduler simply follows the ``Race-to-Idle'\cit{https://doi.org/10.1137/1.9781611973099.100}' approach where a sleeping \proc is woken any time an \at becomes ready and \procs go to idle sleep anytime they run out of work.
114
115	\section{Sleeping}
116	As usual, the corner-stone of any feature related to the kernel is the choice of system call.
117	In terms of blocking a \gls{kthrd} until some event occurs the linux kernel has many available options:
118
119	\paragraph{\texttt{pthread\_mutex}/\texttt{pthread\_cond}}
120	The most classic option is to use some combination of \texttt{pthread\_mutex} and \texttt{pthread\_cond}.
121	These serve as straight forward mutual exclusion and synchronization tools and allow a \gls{kthrd} to wait on a \texttt{pthread\_cond} until signalled.
122	While this approach is generally perfectly appropriate for \glspl{kthrd} waiting after eachother, \io operations do not signal \texttt{pthread\_cond}s.
123	For \io results to wake a \proc waiting on a \texttt{pthread\_cond} means that a different \glspl{kthrd} must be woken up first, and then the \proc can be signalled.
124
125	\subsection{\texttt{io\_uring} and Epoll}
126	An alternative is to flip the problem on its head and block waiting for \io, using \texttt{io\_uring} or even \texttt{epoll}.
127	This creates the inverse situation, where \io operations directly wake sleeping \procs but waking \proc from a running \gls{kthrd} must use an indirect scheme.
128	This generally takes the form of creating a file descriptor, \eg, a dummy file, a pipe or an event fd, and using that file descriptor when \procs need to wake eachother.
129	This leads to additional complexity because there can be a race between these artificial \io operations and genuine \io operations.
130	If not handled correctly, this can lead to the artificial files going out of sync.
131
132	\subsection{Event FDs}
133	Another interesting approach is to use an event file descriptor\cit{eventfd}.
134	This is a Linux feature that is a file descriptor that behaves like \io, \ie, uses \texttt{read} and \texttt{write}, but also behaves like a semaphore.
135	Indeed, all read and writes must use 64bits large values\footnote{On 64-bit Linux, a 32-bit Linux would use 32 bits values.}.
136	Writes add their values to the buffer, that is arithmetic addition and not buffer append, and reads zero out the buffer and return the buffer values so far\footnote{This is without the \texttt{EFD\_SEMAPHORE} flag. This flags changes the behavior of \texttt{read} but is not needed for this work.}.
137	If a read is made while the buffer is already 0, the read blocks until a non-0 value is added.
138	What makes this feature particularly interesting is that \texttt{io\_uring} supports the \texttt{IORING\_REGISTER\_EVENTFD} command, to register an event fd to a particular instance.
139	Once that instance is registered, any \io completion will result in \texttt{io\_uring} writing to the event FD.
140	This means that a \proc waiting on the event FD can be \emph{directly} woken up by either other \procs or incomming \io.
141
142	\begin{figure}
143	\centering
144	\input{idle1.pstex_t}
145	\caption[Basic Idle Sleep Data Structure]{Basic Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
146	Each \proc has a private event FD.}
147	\label{fig:idle1}
148	\end{figure}
149
150
151	\section{Tracking Sleepers}
152	Tracking which \procs are in idle sleep requires a data structure holding all the sleeping \procs, but more importantly it requires a concurrent \emph{handshake} so that no \at is stranded on a ready-queue with no active \proc.
153	The classic challenge is when a \at is made ready while a \proc is going to sleep, there is a race where the new \at may not see the sleeping \proc and the sleeping \proc may not see the ready \at.
154	Since \ats can be made ready by timers, \io operations or other events outside a clusre, this race can occur even if the \proc going to sleep is the only \proc awake.
155	As a result, improper handling of this race can lead to all \procs going to sleep and the system deadlocking.
156
157	Furthermore, the ``Race-to-Idle'' approach means that there may be contention on the data structure tracking sleepers.
158	Contention slowing down \procs attempting to sleep or wake-up can be tolerated.
159	These \procs are not doing useful work and therefore not contributing to overall performance.
160	However, notifying, checking if a \proc must be woken-up and doing so if needed, can significantly affect overall performance and must be low cost.
161
162	\subsection{Sleepers List}
163	Each cluster maintains a list of idle \procs, organized as a stack.
164	This ordering hopefully allows \proc at the tail to stay in idle sleep for extended period of times.
165	Because of these unbalanced performance requirements, the algorithm tracking sleepers is designed to have idle \proc handle as much of the work as possible.
166	The idle \procs maintain the of sleepers among themselves and notifying a sleeping \proc takes as little work as possible.
167	This approach means that maintaining the list is fairly straightforward.
168	The list can simply use a single lock per cluster and only \procs that are getting in and out of idle state will contend for that lock.
169
170	This approach also simplifies notification.
171	Indeed, \procs need to be notify when a new \at is readied, but they also must be notified during resizing, so the \gls{kthrd} can be joined.
172	This means that whichever entity removes idle \procs from the sleeper list must be able to do so in any order.
173	Using a simple lock over this data structure makes the removal much simpler than using a lock-free data structure.
174	The notification process then simply needs to wake-up the desired idle \proc, using \texttt{pthread\_cond\_signal}, \texttt{write} on an fd, etc., and the \proc will handle the rest.
175
176	\subsection{Reducing Latency}
177	As mentioned in this section, \procs going idle for extremely short periods of time is likely in certain common scenarios.
178	Therefore, the latency of doing a system call to read from and writing to the event fd can actually negatively affect overall performance in a notable way.
179	Is it important to reduce latency and contention of the notification as much as possible.
180	Figure~\ref{fig:idle1} shoes the basic idle sleep data structure.
181	For the notifiers, this data structure can cause contention on the lock and the event fd syscall can cause notable latency.
182
183	\begin{figure}
184	\centering
185	\input{idle2.pstex_t}
186	\caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
187	\label{fig:idle2}
188	\end{figure}
189
190	The contention is mostly due to the lock on the list needing to be held to get to the head \proc.
191	That lock can be contended by \procs attempting to go to sleep, \procs waking or notification attempts.
192	The contentention from the \procs attempting to go to sleep can be mitigated slightly by using \texttt{try\_acquire} instead, so the \procs simply continue searching for \ats if the lock is held.
193	This trick cannot be used for waking \procs since they are not in a state where they can run \ats.
194	However, it is worth nothing that notification does not strictly require accessing the list or the head \proc.
195	Therefore, contention can be reduced notably by having notifiers avoid the lock entirely and adding a pointer to the event fd of the first idle \proc, as in Figure~\ref{fig:idle2}.
196	To avoid contention between the notifiers, instead of simply reading the atomic pointer, notifiers atomically exchange it to \texttt{null} so only only notifier will contend on the system call.
197
198	\begin{figure}
199	\centering
200	\input{idle_state.pstex_t}
201	\caption[Improved Idle Sleep Data Structure]{Improved Idle Sleep Data Structure \smallskip\newline An atomic pointer is added to the list, pointing to the Event FD of the first \proc on the list.}
202	\label{fig:idle:state}
203	\end{figure}
204
205	The next optimization that can be done is to avoid the latency of the event fd when possible.
206	This can be done by adding what is effectively a benaphore\cit{benaphore} in front of the event fd.
207	A simple three state flag is added beside the event fd to avoid unnecessary system calls, as shown in Figure~\ref{fig:idle:state}.
208	The flag starts in state \texttt{SEARCH}, while the \proc is searching for \ats to run.
209	The \proc then confirms the sleep by atomically swaping the state to \texttt{SLEEP}.
210	If the previous state was still \texttt{SEARCH}, then the \proc does read the event fd.
211	Meanwhile, notifiers atomically exchange the state to \texttt{AWAKE} state.
212	if the previous state was \texttt{SLEEP}, then the notifier must write to the event fd.
213	However, if the notify arrives almost immediately after the \proc marks itself idle, then both reads and writes on the event fd can be omitted, which reduces latency notably.
214	This leads to the final data structure shown in Figure~\ref{fig:idle}.
215
216	\begin{figure}
217	\centering
218	\input{idle.pstex_t}
219	\caption[Low-latency Idle Sleep Data Structure]{Low-latency Idle Sleep Data Structure \smallskip\newline Each idle \proc is put unto a doubly-linked stack protected by a lock.
220	Each \proc has a private event FD with a benaphore in front of it.
221	The list also has an atomic pointer to the event fd and benaphore of the first \proc on the list.}
222	\label{fig:idle}
223	\end{figure}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: