Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/io.tex

Last change on this file was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 20 months ago
Last corrections to my thesis... hopefully
Property mode set to `100644`
File size: 42.0 KB

Line
1	\chapter{User Level \io}\label{userio}
2	As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
3	I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources.
4	When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}).
5	Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
6
7	\section{Kernel Interface}
8	Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work.
9
10	\subsection{\lstinline{O_NONBLOCK}}\label{ononblock}
11	In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''.
12	In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
13	This feature can be used as the foundation for the non-blocking \io subsystem.
14	However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be used in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait.\footnote{
15	In this context, ready means \emph{some} operation can be performed without blocking.
16	It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
17	For example, a ready read may only return a subset of requested bytes and the read must be issued again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
18	This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
19
20	There are three options to monitor file descriptors (FD) in Linux:\footnote{
21	For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
22	The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
23	@select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
24	All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
25	The group of file descriptors being waited on is called the \newterm{interest set}.
26
27	\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
28	Hence, the array length must be as long as the largest FD currently of interest.
29	On return, it outputs the set modified in-place to identify which of the file descriptors changed state.
30	This destructive change means selecting in a loop requires re-initializing the array for each iteration.
31	Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
32	Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
33	However, these changes are only reflected when the manager makes its next call to @select@.
34	Note, it is possible for the manager thread to never unblock if its current interest set never changes, \eg the sockets/pipes/TTYs it is waiting on never get data again.
35	Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
36
37	\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
38	For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.
39	However, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
40	Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \glspl{kthrd}, while a manager thread is blocked in @poll@.
41
42	\paragraph{\lstinline{epoll}} follows after @poll@, and places the interest set in the kernel rather than the application, where it is managed by an internal \gls{kthrd}.
43	There are two separate functions: one to add to the interest set and another to check for FDs with state changes.
44	This dynamic capability is accomplished by creating an \emph{epoll instance} with a persistent interest set, which is used across multiple calls.
45	As the interest set is augmented, the changes become implicitly part of the interest set for a blocked manager \gls{kthrd}.
46	This capability significantly reduces synchronization between \glspl{kthrd} and the manager calling @epoll@.
47
48	However, all three of these I/O systems have limitations.
49	The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
50	Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed.
51	For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@.
52	Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
53
54	\subsection{POSIX asynchronous I/O (AIO)}
55	An alternative to @O_NONBLOCK@ is the AIO interface.
56	Using AIO, programmers can enqueue operations which are to be performed
57	asynchronously by the kernel.
58	The kernel can communicate
59	completions of these operations in three ways:
60	it can spawn a new \gls{kthrd}; send a Linux signal; or
61	userspace can poll for completion of one or more operations.
62	Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution.
63	Signals and their associated interrupt handlers can also lead to fairly complicated
64	interactions between subsystems, and they have a non-trivial cost.
65	This leaves a single option: polling for completion---this is similar to the previously discussed
66	system calls.
67	While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file
68	descriptors can be regular files or block devices.
69	AIO also supports batching multiple operations in a single system call.
70
71	AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
72	Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing.
73	However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
74	AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
75	This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
76	Finally, AIO does not seem to be a popular interface, which I believe is due in part to this poor polling interface.
77	Linus Torvalds talks about this interface as follows:
78
79	\begin{displayquote}
80	AIO is a horrible ad-hoc design, with the main excuse being ``other,
81	less gifted people, made that design, and we are implementing it for
82	compatibility because database people - who seldom have any shred of
83	taste - actually use it''.
84
85	But AIO was always really really ugly.
86
87	\begin{flushright}
88	-- Linus Torvalds~\cite{AIORant}
89	\end{flushright}
90	\end{displayquote}
91
92	Interestingly, in this e-mail, Linus goes on to describe
93	``a true \textit{asynchronous system call} interface''
94	that does
95	``[an] arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread''
96	in
97	``some kind of arbitrary \textit{queue up asynchronous system call} model''.
98	This description is quite close to the interface described in the next section.
99
100	\subsection{\lstinline{io_uring}}
101	A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
102	Like AIO, it represents \io operations as entries added to a queue.
103	But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
104	The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion.
105
106	One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
107	In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
108
109	On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
110
111	\subsection{Extra Kernel Threads}\label{io:morethreads}
112	Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
113	In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading.
114	However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run.
115	This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
116	This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
117	Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
118
119	\subsection{Discussion}
120	These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete.
121	All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically.
122	For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
123
124	For this project, I selected @io_uring@, in large part because of its generality.
125	While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
126
127	\section{Event-Engine}
128	An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
129	In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}.
130	The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
131
132	\subsection{\lstinline{io_uring} in depth}\label{iouring}
133	Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine.
134	Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
135	Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
136	The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
137	The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
138	The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
139	the completion ring contains entries for the completed I/O operation.
140	Multiple @io_uring@ instances can be created, in which case they each have a copy of the data structures in the figure.
141
142	\begin{figure}
143	\centering
144	\input{io_uring.pstex_t}
145	\caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left).
146	While the completion ring contains plain data, the submission ring contains only references.
147	These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.}
148	\label{fig:iouring}
149	\end{figure}
150
151	New \io operations are submitted to the kernel following 4 steps, which use the components shown in the figure.
152	\begin{enumerate}
153	\item
154	An SQE is allocated from the pre-allocated array \emph{S}.
155	This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory visible by both the kernel and the application, and has a fixed size determined at creation.
156	How these entries are allocated is not important for the functioning of @io_uring@;
157	the only requirement is that no entry is reused before the kernel has consumed it.
158	\item
159	The SQE is filled according to the desired operation.
160	This step is straightforward.
161	The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled to match submission and completion entries.
162	\item
163	The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
164	Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
165	Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
166	Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted.
167	\item
168	The kernel is notified of the change to the ring using the system call @io_uring_enter@.
169	The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
170	The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions.
171	\end{enumerate}
172
173	\begin{sloppypar}
174	The completion side is simpler: applications call @io_uring_enter@ with the flag @IORING_ENTER_GETEVENTS@ to wait on a desired number of operations to complete.
175	The same call can be used to both submit SQEs and wait for operations to complete.
176	When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
177	Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
178	The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring.
179	@io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions.
180	\end{sloppypar}
181
182	The @io_uring_enter@ system call is protected by a lock inside the kernel.
183	This protection means that concurrent calls to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
184	It is possible to do the first three submission steps in parallel;
185	however, doing so requires careful synchronization.
186
187	@io_uring@ also introduces constraints on the number of simultaneous operations that can be ``in flight''.
188	First, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
189	Second, the @io_uring_enter@ system call can fail because ``The kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can have pending.''.
190	This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
191
192	An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
193	Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
194	In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
195	To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
196	Three kinds of operations that can need the \glspl{kthrd} are:
197
198	\paragraph{Operations using} @IOSQE_ASYNC@.
199	This is a straightforward case, users can explicitly set the @IOSQE_ASYNC@ flag on an SQE to specify that it \emph{must} be delegated to a different \gls{kthrd}.
200
201	\paragraph{Bounded operations.}
202	This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
203	Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices.
204	In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
205
206	\paragraph{Unbounded operations that must be retried.}
207	While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
208	Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel.
209	@io_uring@ maintains a separate pool for these operations.
210	The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
211	Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel.
212	Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
213
214	@io_uring@ implicitly spawns and joins both the bounded and unbounded workers based on its evaluation of the needs of the workload.
215	This effectively encapsulates the work that is needed when using @epoll@.
216	Indeed, @io_uring@ does not change Linux's underlying handling of \io operations, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system.
217
218
219	\subsection{Multiplexing \io: Submission}
220
221	The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made on the submission side.
222	While there is freedom in designing the submission side, there are some realities of @io_uring@ that must be taken into account.
223	It is possible to do the first steps of submission in parallel;
224	however, the duration of the system call scales with the number of entries submitted.
225	The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
226	Beyond this limit, the length of the system call is the throughput-limiting factor.
227	I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
228	Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
229	Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
230	As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}.
231	Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance.
232
233	A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
234	SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
235	The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
236	For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable.
237	Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
238
239	Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \proc, in decoupled pools, \ie, a pool of \procs using a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
240	These three sharding approaches are analyzed.
241
242	\subsubsection{Private Instances}
243	The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
244	This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps.
245	This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
246	This failure is the \newterm{serially reusable problem}~\cite{SeriallyReusable}.
247	Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in order of allocation.\footnote{
248	To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
249	From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
250	In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}.
251	Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
252	Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
253
254	\begin{figure}
255	\centering
256	\input{pivot_ring.pstex_t}
257	\caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition.
258	When submitting, the partition is advanced.
259	The kernel considers the partition as the head of the ring.}
260	\label{fig:pring}
261	\end{figure}
262
263	This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
264	However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
265	Instances running out of SQEs cannot run \ats wanting to do \io operations.
266	In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur.
267
268	A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
269	\Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
270	While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
271
272	Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set.
273	Assume both \ats are running on the same \gls{proc}, and the \io \at is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \at.
274	In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
275	No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
276	However, the \io \gls{proc} is unable to help because it is executing the spinning \at.
277	This results in a deadlock.
278	While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
279	Furthermore, this pattern is difficult to reliably detect and avoid.
280	Once in this situation, the only escape is to interrupt the spinning \at, either directly or via some regular preemption, \eg time slicing.
281	Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
282	Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
283	Therefore, a more satisfying solution is for the \at submitting the operation to notice that the instance is unused and simply go ahead and use it.
284	This approach is presented shortly.
285
286	\subsubsection{Public Instances}
287	The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
288	\Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance.
289	Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
290	Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
291	\begin{itemize}
292	\item
293	The synchronization needed to submit does not induce more contention than @io_uring@ already does.
294	\item
295	The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
296	This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
297	\end{itemize}
298
299	Allocation in this scheme is fairly easy.
300	Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs.
301	Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
302	The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
303
304	Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
305	Furthermore, the routing algorithm should block operations upfront if none of the instances have available SQEs.
306
307	Once an SQE is allocated, \ats insert the \io request information and keep track of the SQE index and the instance it belongs to.
308
309	Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread safe, and then the kernel must be notified using the @io_uring_enter@ system call.
310	The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean an SQE multiple times in the ring buffer, which is undefined behaviour.
311	However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations are complete.
312
313	Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
314	Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long periods before being submitted.
315	Balancing submission can be handled by either designating one of the submitting \ats as the \at responsible for the system call for the current batch of SQEs or by having some other party regularly submit all ready SQEs, \eg, the poller \at mentioned later in this section.
316
317	Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}.
318	However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
319	Indeed, as long as there is a ``next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request.
320	Once the system call is done, the submitter must also free SQEs so that the allocator can reuse them.
321
322	Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
323	Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
324	Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
325	If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
326	A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
327
328	The big advantage of the pool of SQE instances approach is that it is fairly flexible.
329	It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
330	It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
331	The downside to this approach is that many of the steps used for submitting need complex synchronization to work properly.
332	The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
333	The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
334	All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases.
335
336	\subsubsection{Instance borrowing}
337	Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
338	The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling.
339	The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids.
340	When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
341	However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
342	A compromise between these approaches is to allow tight coupling but have the option to revoke the coupling dynamically when failure cases arise.
343	I call this approach \newterm{instance borrowing}.\footnote{
344	While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
345
346	As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm.
347	In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
348	When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance.
349	This instance is now bound to the \gls{proc} the \at is running on.
350	This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
351	This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
352	However, it differs in that revocation by the arbiter means this approach does not suffer from the deadlock scenario described above.
353
354	Arbitration is needed in the following cases:
355	\begin{enumerate}
356	\item The current \gls{proc} does not hold an instance.
357	\item The current instance does not have sufficient SQEs to satisfy the request.
358	\item The current \gls{proc} has a wrong instance.
359	This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}.
360	\end{enumerate}
361	However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
362	Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.}
363	A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
364	If not, it proceeds, otherwise it delegates the operation to the arbiter.
365	Once the operation is completed, the \gls{proc} lowers its local flag.
366
367	Correspondingly, before revoking an instance, the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
368	Only then does it reclaim the instance and potentially assign it to another \gls{proc}.
369
370	The arbiter maintains four lists around which it makes its decisions:
371	\begin{enumerate}
372	\item A list of pending submissions.
373	\item A list of pending allocations.
374	\item A list of instances currently borrowed by \glspl{proc}.
375	\item A list of instances currently available.
376	\end{enumerate}
377
378	\paragraph{External Submissions} are handled by the arbiter by revoking the appropriate instance and adding the submission to the submission ring.
379	However, there is no need to immediately revoke the instance.
380	External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
381	This means whoever is responsible for the system call first checks whether the instance has any external submissions.
382	If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
383
384	\paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
385	Otherwise, it must hold on to the list of threads until SQEs are made available again.
386	This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
387
388	While an arbiter has the potential to solve many of the problems mentioned above, it also introduces a significant amount of complexity.
389	Tracking which processors are borrowing which instances and which instances have SQEs available ends up adding a significant synchronization prelude to any I/O operation.
390	Any submission must start with a handshake that pins the currently borrowed instance, if available.
391	An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
392	Once the allocation is completed, the submission must check that the instance is still burrowed before attempting to flush.
393	These synchronization steps turn out to have a similar cost to the multiple shared-instances approach.
394	Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end up cycling the processors, which leads to significant cache deterioration.
395	For these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
396
397	\section{Interface}
398	The final part of the \io subsystem is its interface.
399	Multiple approaches can be offered to programmers, each with advantages and disadvantages.
400	The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
401	The following sections discuss some useful options, using @read@ as an example.
402	The standard Linux interface for C is:
403	\begin{cfa}
404	ssize_t read(int fd, void *buf, size_t count);
405	\end{cfa}
406
407	\subsection{Replacement}
408	Replacing the C \io subsystem is the more intrusive and draconian approach.
409	The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's.
410	This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation.
411	It also offers a presumably well known and familiar API that C programmers can simply continue to work with.
412	%However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement.
413	However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs.
414	This approach was rejected as being laudable but infeasible.
415
416	\subsection{Synchronous Extension}
417	Another interface option is to offer an interface different in name only.
418	In this approach, an alternative call is created for each supported system calls.
419	For example:
420	\begin{cfa}
421	ssize_t cfa_read(int fd, void *buf, size_t count);
422	\end{cfa}
423	The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking.
424
425	This approach is feasible and still familiar to C programmers.
426	It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries.
427	However, it has the advantage of implementation simplicity.
428	Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
429
430	\subsection{Asynchronous Extension}
431	A fairly traditional way of providing asynchronous interactions is using a future mechanism~\cite{multilisp}, \eg:
432	\begin{cfa}
433	future(ssize_t) read(int fd, void *buf, size_t count);
434	\end{cfa}
435	where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested.
436	The data read is placed in @buf@.
437	The problem is that both the bytes count and data form the synchronization object, not just the bytes read.
438	Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this.
439	A classical asynchronous API is:
440	\begin{cfa}
441	future([ssize_t, void *]) read(int fd, size_t count);
442	\end{cfa}
443	where the future tuple covers the components that require synchronization.
444	However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
445	Because of the performance implications of this API, the first approach is considered preferable as it is more familiar to C programmers.
446
447	\subsection{Direct \lstinline{io_uring} Interface}
448	The last interface directly exposes the underlying @io_uring@ interface, \eg:
449	\begin{cfa}
450	array(SQE, want) cfa_io_allocate(int want);
451	void cfa_io_submit( const array(SQE, have) & );
452	\end{cfa}
453	where the generic @array@ contains an array of SQEs with a size that may be less than the request.
454	This offers more flexibility to users wanting to fully utilize all of the @io_uring@ features.
455	However, it is not the most user-friendly option.
456	It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
457
458	As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension:
459	\begin{cfa}
460	ssize_t cfa_read(int fd, void *buf, size_t count);
461	future(ssize_t) async_read(int fd, void *buf, size_t count);
462	\end{cfa}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: