source: doc/theses/thierry_delisle_PhD/thesis/text/io.tex @ 6a896b0

ADTast-experimentalpthread-emulationqualifiedEnum
Last change on this file since 6a896b0 was 847bb6f, checked in by Peter A. Buhr <pabuhr@…>, 22 months ago

proofread chapter text/io.tex, and updates in other chapaters

  • Property mode set to 100644
File size: 42.2 KB
Line 
1\chapter{User Level \io}
2As mentioned in Section~\ref{prev:io}, user-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
3Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
4
5\section{Kernel Interface}
6Since this work fundamentally depends on operating-system support, the first step of this design is to discuss the available interfaces and pick one (or more) as the foundation for the non-blocking \io subsystem in this work.
7
8\subsection{\lstinline{O_NONBLOCK}}
9In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''.
10In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
11This feature can be used as the foundation for the non-blocking \io subsystem.
12However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be used in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait.\footnote{
13In this context, ready means \emph{some} operation can be performed without blocking.
14It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
15For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
16This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
17
18There are three options to monitor file descriptors in Linux:\footnote{
19For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
20The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.},
21@select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
22All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
23The group of file descriptors being waited on is called the \newterm{interest set}.
24
25\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
26Hence, the array length must be as long as the largest FD currently of interest.
27On return, it outputs the set in place to identify which of the file descriptors changed state.
28This destructive change means selecting in a loop requires re-initializing the array for each iteration.
29Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
30Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
31However, these changes are only reflected when the manager makes its next call to @select@.
32Note, it is possible for the manager thread to never unblock if its current interest set never changes, \eg the sockets/pipes/ttys it is waiting on never get data again.
33Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
34
35\begin{comment}
36From: Tim Brecht <brecht@uwaterloo.ca>
37Subject: Re: FD sets
38Date: Wed, 6 Jul 2022 00:29:41 +0000
39
40Large number of open files
41--------------------------
42
43In order to be able to use more than the default number of open file
44descriptors you may need to:
45
46o increase the limit on the total number of open files /proc/sys/fs/file-max
47  (on Linux systems)
48
49o increase the size of FD_SETSIZE
50  - the way I often do this is to figure out which include file __FD_SETSIZE
51    is defined in, copy that file into an appropriate directory in ./include,
52    and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
53    gets used
54
55  For example on a RH 9.0 distribution I've copied
56  /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
57
58  Then I modify typesizes.h to look something like:
59
60  #ifdef BIGGER_FD_SETSIZE
61  #define __FD_SETSIZE            32767
62  #else
63  #define __FD_SETSIZE            1024
64  #endif
65
66  Note that the since I'm moving and testing the userver on may different
67  machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
68
69  This way if you redefine the FD_SETSIZE it will get used instead of the
70  default original file.
71\end{comment}
72
73\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
74(For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.)
75Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialize on every call.
76Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \gls{kthrd}, while a manager thread is blocked in @poll@.
77
78\paragraph{\lstinline{epoll}} follows after @poll@, and places the interest set in the kernel rather than the application, where it is managed by an internal \gls{kthrd}.
79There are two separate functions: one to add to the interest set and another to check for FDs with state changes.
80This dynamic capability is accomplished by creating an \emph{epoll instance} with a persistent interest set, which is used across multiple calls.
81As the interest set is augmented, the changes become implicitly part of the interest set for a blocked manager \gls{kthrd}.
82This capability significantly reduces synchronization between \glspl{kthrd} and the manager calling @epoll@.
83
84However, all three of these I/O systems have limitations.
85The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
86Furthermore, @epoll@ has been shown to have problems with pipes and ttys~\cit{Peter's examples in some fashion}.
87Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
88
89\subsection{POSIX asynchronous I/O (AIO)}
90An alternative to @O_NONBLOCK@ is the AIO interface.
91Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
92Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or by polling for completion of one or more operation.
93For this work, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}.
94Using interrupts handlers can also lead to fairly complicated interactions between subsystems and has non-trivial cost.
95Leaving polling for completion, which is similar to the previous system calls.
96AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
97It also supports batching multiple operations in a single system call.
98
99AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
100For the purpose of \io multiplexing, @aio_suspend@ is the best interface.
101However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
102AIO also suffers from the limitation of specifying which requests have completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
103This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
104Finally, AIO does not seem to be a popular interface, which I believe is due in part to this poor polling interface.
105Linus Torvalds talks about this interface as follows:
106
107\begin{displayquote}
108        AIO is a horrible ad-hoc design, with the main excuse being ``other,
109        less gifted people, made that design, and we are implementing it for
110        compatibility because database people - who seldom have any shred of
111        taste - actually use it''.
112
113        But AIO was always really really ugly.
114
115        \begin{flushright}
116                -- Linus Torvalds~\cite{AIORant}
117        \end{flushright}
118\end{displayquote}
119
120Interestingly, in this e-mail, Linus goes on to describe
121``a true \textit{asynchronous system call} interface''
122that does
123``[an] arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread''
124in
125``some kind of arbitrary \textit{queue up asynchronous system call} model''.
126This description is actually quite close to the interface described in the next section.
127
128\subsection{\lstinline{io_uring}}
129A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
130Like AIO, it represents \io operations as entries added to a queue.
131But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
132The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
133
134One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
135In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
136
137On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user-space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger followup requests on completion.
138
139\subsection{Extra Kernel Threads}\label{io:morethreads}
140Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
141In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
142However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
143This approach is used by languages like Go\cit{Go}, frameworks like libuv\cit{libuv}, and web servers like Apache~\cite{apache} and Nginx~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
144This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
145As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
146
147\subsection{Discussion}
148These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
149All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
150For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cit{https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
151
152For this project, I selected @io_uring@, in large parts because of its generality.
153While @epoll@ has been shown to be a good solution for socket \io (\cite{DBLP:journals/pomacs/KarstenB20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
154
155\section{Event-Engine}
156An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
157In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
158The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
159
160\subsection{\lstinline{io_uring} in depth}
161Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine.
162Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
163Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
164The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
165The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
166The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
167the completion ring contains entries for the completed I/O operation.
168Multiple @io_uring@ instances can be created, in which case they each have a copy of the data structures in the figure.
169
170\begin{figure}
171        \centering
172        \input{io_uring.pstex_t}
173        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffer are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
174        \label{fig:iouring}
175\end{figure}
176
177New \io operations are submitted to the kernel following 4 steps, which use the components shown in the figure.
178\begin{enumerate}
179\item
180An SQE is allocated from the pre-allocated array \emph{S}.
181This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory visible by both the kernel and the application, and has a fixed size determined at creation.
182How these entries are allocated is not important for the functioning of @io_uring@;
183the only requirement is that no entry is reused before the kernel has consumed it.
184\item
185The SQE is filled according to the desired operation.
186This step is straight forward.
187The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
188\item
189The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
190Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
191Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
192Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
193\item
194The kernel is notified of the change to the ring using the system call @io_uring_enter@.
195The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
196The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
197\end{enumerate}
198
199\begin{sloppypar}
200The completion side is simpler: applications call @io_uring_enter@ with the flag @IORING_ENTER_GETEVENTS@ to wait on a desired number of operations to complete.
201The same call can be used to both submit SQEs and wait for operations to complete.
202When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
203Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
204It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
205The system call is only needed if the application wants to block waiting for operations to complete.
206\end{sloppypar}
207
208The @io_uring_enter@ system call is protected by a lock inside the kernel.
209This protection means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
210It is possible to do the first three submission steps in parallel;
211however, doing so requires careful synchronization.
212
213@io_uring@ also introduces constraints on the number of simultaneous operations that can be ``in flight''.
214First, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
215Second, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can have pending.''.
216This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
217
218\subsection{Multiplexing \io: Submission}
219
220The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
221While there is freedom in designing the submission side, there are some realities of @io_uring@ that must be taken into account.
222It is possible to do the first steps of submission in parallel;
223however, the duration of the system call scales with the number of entries submitted.
224The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
225Beyond this limit, the length of the system call is the throughput limiting factor.
226I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
227Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
228Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
229As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
230Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
231There is nothing preventing a new operation with, \eg the same file descriptors to a different @io_uring@ instance.
232
233A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
234SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
235The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
236Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
237Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
238
239Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
240These three sharding approaches are analyzed.
241
242\subsubsection{Private Instances}
243The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling.
244This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps.
245This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
246This failure is the serially reusable problem~\cite{SeriallyReusable}.
247Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
248To remove this requirement, a \gls{thrd} needs the ability to ``yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
249From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
250In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
251Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
252Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, etc.
253
254\begin{figure}
255        \centering
256        \input{pivot_ring.pstex_t}
257        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appending to the first partition.
258        When submitting, the partition is advanced.
259        The kernel considers the partition as the head of the ring.}
260        \label{fig:pring}
261\end{figure}
262
263This approach has the advantage that it does not require much of the synchronization needed in a shared approach.
264However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly.
265Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations.
266In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
267
268A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
269\Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
270While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
271
272Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set.
273Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}.
274In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
275No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
276However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock.
277While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise ``in the wild''.
278Furthermore, this pattern is difficult to reliably detect and avoid.
279Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption (\eg time slicing).
280Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
281% However, a more important reason why interrupting the \gls{thrd} is not a satisfying solution is that the \gls{proc} is using the instance it is tied to.
282% If it were to use it, then helping could be done as part of the usage.
283Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
284Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it.
285This approach is presented shortly.
286
287\subsubsection{Public Instances}
288The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
289\Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance.
290Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
291Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
292\begin{itemize}
293\item
294The synchronization needed to submit does not induce more contention than @io_uring@ already does.
295\item
296The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
297This aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
298\end{itemize}
299
300Allocation in this scheme is fairly easy.
301Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written to safely and have a field called @user_data@ that the kernel only reads to copy to @cqe@s.
302Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
303% This requires a simple concurrent bag.
304The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
305
306Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
307Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
308
309Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to.
310
311Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
312The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it is invalid to have the same \lstinline{sqe} multiple times in a ring buffer.
313However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations complete.
314
315Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
316Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
317Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
318
319Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}.
320However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
321Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request.
322Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
323
324Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
325Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
326Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
327If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
328A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
329
330With the pool of SEQ instances approach, the big advantage is that it is fairly flexible.
331It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
332It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
333The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
334The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
335The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
336All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
337
338\subsubsection{Instance borrowing}
339Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
340The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
341The second approach suffers from loose coupling causing operations to have synchronization overhead, which tighter coupling avoids.
342When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
343However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
344A compromise between these approaches is to allow tight coupling but have the option to revoke the coupling dynamically when failure cases arise.
345I call this approach \newterm{instance borrowing}.\footnote{
346While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
347
348In this approach, each cluster (see Figure~\ref{fig:system}) owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
349When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
350This instance is now bound to the \gls{proc} the \gls{thrd} is running on.
351This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
352This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
353However, it differs in that revocation by the arbiter (an interrupt) means this approach does not suffer from the deadlock scenario described above.
354
355Arbitration is needed in the following cases:
356\begin{enumerate}
357        \item The current \gls{proc} does not hold an instance.
358        \item The current instance does not have sufficient SQEs to satisfy the request.
359        \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}.
360\end{enumerate}
361However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
362Note the handshake is not lock \emph{free} since it lacks the proper progress guarantee.}
363A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
364If not, it proceeds, otherwise it delegates the operation to the arbiter.
365Once the operation is completed, the \gls{proc} lowers its local flag.
366
367Correspondingly, before revoking an instance, the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
368Only then does it reclaim the instance and potentially assign it to an other \gls{proc}.
369
370The arbiter maintains four lists around which it makes its decisions:
371\begin{enumerate}
372        \item A list of pending submissions.
373        \item A list of pending allocations.
374        \item A list of instances currently borrowed by \glspl{proc}.
375        \item A list of instances currently available.
376\end{enumerate}
377
378\paragraph{External Submissions} are handled by the arbiter by revoking the appropriate instance and adding the submission to the submission ring.
379However,  there is no need to immediately revoke the instance.
380External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
381This means whoever is responsible for the system call, first checks if the instance has any external submissions.
382If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
383
384\paragraph{Pending Allocations} are handled by the arbiter when it has available instances and can directly hand over the instance and satisfy the request.
385Otherwise, it must hold onto the list of threads until SQEs are made available again.
386This handling is more complex when an allocation requires multiple SQEs, since the arbiter must make a decision between satisfying requests in FIFO ordering or for fewer SQEs.
387
388While an arbiter has the potential to solve many of the problems mentioned above, it also introduces a significant amount of complexity.
389Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation.
390Any submission must start with a handshake that pins the currently borrowed instance, if available.
391An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
392Once the allocation is completed, the submission must check that the instance is still burrowed before attempting to flush.
393These synchronization steps turn out to have a similar cost to the multiple shared-instances approach.
394Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration.
395For these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
396
397\subsubsection{Private Instances V2}
398
399% Verbs of this design
400
401% Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
402
403% Submission: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
404
405% Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
406
407% Collect: Once the system call is done, it returns how many sqes were consumed by the system. These must be freed for allocation. Must interact with the arbiter to notify that things are now ready.
408
409% Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
410
411
412% alloc():
413%       proc.io->in_use = true, __ATOMIC_ACQUIRE
414%       if cltr.io.flag || !proc.io || proc.io->flag:
415%               return alloc_slow(cltr.io, proc.io)
416
417%       a = alloc_fast(proc.io)
418%       if a:
419%               proc.io->in_use = false, __ATOMIC_RELEASE
420%               return a
421
422%       return alloc_slow(cltr.io)
423
424% alloc_fast()
425%       left = proc.io->submit_q.free.tail - proc.io->submit_q.free.head
426%       if num_entries - left < want:
427%               return None
428
429%       a = ready[head]
430%       head = head + 1, __ATOMIC_RELEASE
431
432% alloc_slow()
433%       cltr.io.flag = true, __ATOMIC_ACQUIRE
434%       while(proc.io && proc.io->in_use) pause;
435
436
437
438% submit(a):
439%       proc.io->in_use = true, __ATOMIC_ACQUIRE
440%       if cltr.io.flag || proc.io != alloc.io || proc.io->flag:
441%               return submit_slow(cltr.io)
442
443%       submit_fast(proc.io, a)
444%       proc.io->in_use = false, __ATOMIC_RELEASE
445
446% polling()
447%       loop:
448%               yield
449%               flush()
450%               io_uring_enter
451%               collect
452%               handle()
453
454\section{Interface}
455
456The last important part of the \io subsystem is its interface.
457There are multiple approaches that can be offered to programmers, each with advantages and disadvantages.
458The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
459The following sections discuss some useful options using @read@ as an example.
460The standard Linux interface for C is :
461\begin{lstlisting}
462ssize_t read(int fd, void *buf, size_t count);
463\end{lstlisting}
464
465\subsection{Replacement}
466Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
467The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
468This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
469It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
470However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
471If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
472Since the gcc ecosystems does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
473
474\subsection{Synchronous Extension}
475Another interface option is to offer an interface different in name only.
476For example:
477\begin{lstlisting}
478ssize_t cfa_read(int fd, void *buf, size_t count);
479\end{lstlisting}
480This approach is feasible and still familiar to C programmers.
481It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
482However, it has the advantage of implementation simplicity.
483Finally, there is a certain irony to using a blocking synchronous interfaces for a feature often referred to as ``non-blocking'' \io.
484
485\subsection{Asynchronous Extension}
486A fairly traditional way of providing asynchronous interactions is using a future mechanism~\cite{multilisp}, \eg:
487\begin{lstlisting}
488future(ssize_t) read(int fd, void *buf, size_t count);
489\end{lstlisting}
490where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
491The data read is placed in @buf@.
492The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
493Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
494A classical asynchronous API is:
495\begin{lstlisting}
496future([ssize_t, void *]) read(int fd, size_t count);
497\end{lstlisting}
498where the future tuple covers the components that require synchronization.
499However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
500Because of the performance implications of this API, the first approach is considered preferable as it is more familiar to C programmers.
501
502\subsection{Direct \lstinline{io_uring} Interface}
503The last interface directly exposes the underlying @io_uring@ interface, \eg:
504\begin{lstlisting}
505array(SQE, want) cfa_io_allocate(int want);
506void cfa_io_submit( const array(SQE, have) & );
507\end{lstlisting}
508where the generic @array@ contains an array of SQEs with a size that may be less than the request.
509This offers more flexibility to users wanting to fully utilize all of the @io_uring@ features.
510However, it is not the most user-friendly option.
511It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricting users to usages that are compatible with how \CFA internally uses @io_uring@.
Note: See TracBrowser for help on using the repository browser.