Context Navigation

source: doc/theses/thierry_delisle_PhD/thesis/text/io.tex @ 2a859b5

ADTast-experimentalpthread-emulationqualifiedEnum

Last change on this file since 2a859b5 was 3112733, checked in by Thierry Delisle <tdelisle@…>, 2 years ago
Filled in all of Chapter 4. It's not great but it's worth discussing
Property mode set to `100644`
File size: 40.4 KB

Rev	Line
[d4a4b17]	1	\chapter{User Level \io}
[f1bce515]	2	As mentioned in Section~\ref{prev:io}, User-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
[c5af4f9]	3	Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
[86c1f1c3]	4
[c292244]	5	\section{Kernel Interface}
[c6640a3]	6	Since this work fundamentally depends on operating-system support, the first step of any design is to discuss the available interfaces and pick one (or more) as the foundations of the non-blocking \io subsystem.
[86c1f1c3]	7
[c6640a3]	8	\subsection{\lstinline{O_NONBLOCK}}
[f1bce515]	9	In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''.
	10	In this mode, ``Neither the @open()@ nor any subsequent \io operations on the [opened file descriptor] will cause the calling process to wait''~\cite{MAN:open}.
	11	This feature can be used as the foundation for the non-blocking \io subsystem.
	12	However, for the subsystem to know when an \io operation completes, @O_NONBLOCK@ must be use in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it does not cause the process to wait
	13	\footnote{In this context, ready means \emph{some} operation can be performed without blocking.
	14	It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
	15	For example, a ready read may only return a subset of bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}.
[c6640a3]	16	This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block.
[86c1f1c3]	17
[f1bce515]	18	There are three options to monitor file descriptors in Linux
	19	\footnote{For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
	20	The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.},
	21	@select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
	22	All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
	23	The group of file descriptors being waited is called the \newterm{interest set}.
	24
	25	\paragraph{\lstinline{select}} is the oldest of these options, it takes as an input a contiguous array of bits, where each bits represent a file descriptor of interest.
	26	On return, it modifies the set in place to identify which of the file descriptors changed status.
	27	This destructive change means that calling select in a loop requires re-initializing the array each time and the number of file descriptors supported has a hard limit.
	28	Another limit of @select@ is that once the call is started, the interest set can no longer be modified.
	29	Monitoring a new file descriptor generally requires aborting any in progress call to @select@
	30	\footnote{Starting a new call to \lstinline{select} is possible but requires a distinct kernel thread, and as a result is not an acceptable multiplexing solution when the interest set is large and highly dynamic unless the number of parallel calls to \lstinline{select} can be strictly bounded.}.
	31
	32	\paragraph{\lstinline{poll}} is an improvement over select, which removes the hard limit on the number of file descriptors and the need to re-initialize the input on every call.
	33	It works using an array of structures as an input rather than an array of bits, thus allowing a more compact input for small interest sets.
	34	Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed while the call is blocked.
	35
	36	\paragraph{\lstinline{epoll}} further improves these two functions by allowing the interest set to be dynamically added to and removed from while a \gls{kthrd} is blocked on an @epoll@ call.
	37	This dynamic capability is accomplished by creating an \emph{epoll instance} with a persistent interest set, which is used across multiple calls.
	38	This capability significantly reduces synchronization overhead on the part of the caller (in this case the \io subsystem), since the interest set can be modified when adding or removing file descriptors without having to synchronize with other \glspl{kthrd} potentially calling @epoll@.
	39
	40	However, all three of these system calls have limitations.
	41	The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
	42	Furthermore, @epoll@ has been shown to have problems with pipes and ttys~\cit{Peter's examples in some fashion}.
	43	Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
[c292244]	44
[c6640a3]	45	\subsection{POSIX asynchronous I/O (AIO)}
[f1bce515]	46	An alternative to @O_NONBLOCK@ is the AIO interface.
	47	Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
	48	Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or by polling for completion of one or more operation.
	49	For this work, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}.
	50	Using interrupts handlers can also lead to fairly complicated interactions between subsystems and has non-trivial cost.
	51	Leaving polling for completion, which is similar to the previous system calls.
	52	AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
	53	It also supports batching multiple operations in a single system call.
	54
	55	AIO offers two different approach to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
	56	For the purpose of \io multiplexing, @aio_suspend@ is the best interface.
	57	However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
	58	AIO also suffers from the limitation of specifying which requests have completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
	59	This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
	60	Finally, AIO does not seem to be a popular interface, which I believe is due in part to this poor polling interface.
	61	Linus Torvalds talks about this interface as follows:
[b9537e6]	62
	63	\begin{displayquote}
[c6640a3]	64	AIO is a horrible ad-hoc design, with the main excuse being ``other,
[b9537e6]	65	less gifted people, made that design, and we are implementing it for
	66	compatibility because database people - who seldom have any shred of
[c6640a3]	67	taste - actually use it''.
[b9537e6]	68
	69	But AIO was always really really ugly.
	70
	71	\begin{flushright}
	72	-- Linus Torvalds\cit{https://lwn.net/Articles/671657/}
	73	\end{flushright}
	74	\end{displayquote}
	75
[c6640a3]	76	Interestingly, in this e-mail, Linus goes on to describe
[b9537e6]	77	``a true \textit{asynchronous system call} interface''
	78	that does
	79	``[an] arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread''
	80	in
	81	``some kind of arbitrary \textit{queue up asynchronous system call} model''.
[c292244]	82	This description is actually quite close to the interface described in the next section.
	83
[c6640a3]	84	\subsection{\lstinline{io_uring}}
[f1bce515]	85	A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
	86	Like AIO, it represents \io operations as entries added to a queue.
	87	But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
	88	The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
[b9537e6]	89
[f1bce515]	90	One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
	91	In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
[c292244]	92
[c6640a3]	93	On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user-space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger followup requests on completion.
[86c1f1c3]	94
[b9537e6]	95	\subsection{Extra Kernel Threads}\label{io:morethreads}
[f1bce515]	96	Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing.
	97	In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading.
	98	However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run.
	99	This approach is used by languages like Go\cit{Go} and frameworks like libuv\cit{libuv}, since it has the advantage that it can easily be used across multiple operating systems.
	100	This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
	101	As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
[86c1f1c3]	102
	103	\subsection{Discussion}
[f1bce515]	104	These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
	105	All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
	106	For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cit{https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
[86c1f1c3]	107
[f1bce515]	108	For this project, I selected @io_uring@, in large parts because of its generality.
	109	While @epoll@ has been shown to be a good solution for socket \io (\cite{DBLP:journals/pomacs/KarstenB20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
[86c1f1c3]	110
	111	\section{Event-Engine}
[f1bce515]	112	An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
	113	In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engines then starts the operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}.
	114	The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
[c6640a3]	115
	116	\subsection{\lstinline{io_uring} in depth}
	117	Before going into details on the design of my event engine, more details on @io_uring@ usage are presented, each important in the design of the engine.
	118	Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
	119	Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
	120	The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
	121	The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
[f1bce515]	122	The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
[c6640a3]	123	the completion ring contains entries for the completed I/O operation.
	124	Multiple @io_uring@ instances can be created, in which case they each have a copy of the data structures in the figure.
[c292244]	125
	126	\begin{figure}
	127	\centering
	128	\input{io_uring.pstex_t}
[f1bce515]	129	\caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffer are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
[c292244]	130	\label{fig:iouring}
	131	\end{figure}
	132
[c6640a3]	133	New \io operations are submitted to the kernel following 4 steps, which use the components shown in the figure.
	134	\begin{enumerate}
	135	\item
[f1bce515]	136	An SQE is allocated from the pre-allocated array (denoted \emph{S} in Figure~\ref{fig:iouring}).
	137	This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory visible by both the kernel and the application, and has a fixed size determined at creation.
	138	How these entries are allocated is not important for the functioning of @io_uring@, the only requirement is that no entry is reused before the kernel has consumed it.
[c6640a3]	139	\item
[f1bce515]	140	The SQE is filled according to the desired operation.
	141	This step is straight forward, the only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
[c6640a3]	142	\item
[f1bce515]	143	The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
	144	Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
	145	Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
[c6640a3]	146	\item
[f1bce515]	147	The kernel is notified of the change to the ring using the system call @io_uring_enter@.
	148	The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
	149	The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
[c6640a3]	150	\end{enumerate}
[c292244]	151
[c6640a3]	152	\begin{sloppypar}
[f1bce515]	153	The completion side is simpler: applications call @io_uring_enter@ with the flag @IORING_ENTER_GETEVENTS@ to wait on a desired number of operations to complete.
	154	The same call can be used to both submit SQEs and wait for operations to complete.
	155	When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
	156	Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
	157	It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
	158	The system call is only needed if the application wants to block waiting for operations to complete.
[c6640a3]	159	\end{sloppypar}
[c292244]	160
[f1bce515]	161	The @io_uring_enter@ system call is protected by a lock inside the kernel.
	162	This protection means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
	163	It is possible to do the first three submission steps in parallel, however, doing so requires careful synchronization.
[c292244]	164
[f1bce515]	165	@io_uring@ also introduces constraints on the number of simultaneous operations that can be ``in flight''.
	166	Obviously, SQEs are allocated from a fixed-size array, meaning that there is a hard limit to how many SQEs can be submitted at once.
	167	In addition, the @io_uring_enter@ system call can fail because ``The kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can have pending.''.
	168	This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
[d4a4b17]	169
	170	\subsection{Multiplexing \io: Submission}
[f1bce515]	171	The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
	172	While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted.
	173	The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
	174	Beyond this limit, the length of the system call is the throughput limiting factor.
[f2bc9fa]	175	I concluded from early experiments that preparing submissions seems to take at most as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
[f1bce515]	176	Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
	177	Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
	178	Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously
	179	\footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}.
[c5af4f9]	180	Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
	181	There is nothing preventing a new operation with, for example, the same file descriptors to a different @io_uring@ instance.
[f1bce515]	182
	183	A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
	184	SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
	185	The consequence of this feature is that filling SQEs can be arbitrarly complex and therefore users may need to run arbitrary code between allocation and submission.
	186	Supporting chains is a requirement of the \io subsystem, but it is still valuable.
	187	Support for this feature can be fulfilled simply to supporting arbitrary user code between allocation and submission.
	188
	189	\subsubsection{Public Instances}
	190	One approach is to have multiple shared instances.
	191	\Glspl{thrd} attempting \io operations pick one of the available instances and submit operations to that instance.
	192	Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
	193	Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
	194	This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
	195
	196	Allocation in this scheme can be handled fairly easily.
	197	Free SQEs, \ie, SQEs that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s.
	198	Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
	199	This requires a simple concurrent bag.
	200	The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
	201
[f2bc9fa]	202	Allocation failures need to be pushed up to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.
[f1bce515]	203	Furthermore, the routing algorithm should block operations up-front if none of the instances have available SQEs.
[d4a4b17]	204
[c6640a3]	205	Once an SQE is allocated, \glspl{thrd} can fill them normally, they simply need to keep track of the SQE index and which instance it belongs to.
[d4a4b17]	206
[f1bce515]	207	Once an SQE is filled in, what needs to happen is that the SQE must be added to the submission ring buffer, an operation that is not thread-safe on itself, and the kernel must be notified using the @io_uring_enter@ system call.
	208	The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail
	209	\footnote{This is because it is invalid to have the same \lstinline{sqe} multiple times in the ring buffer.}.
	210	However, as mentioned, the system call itself can fail with the expectation that it will be retried once some of the already submitted operations complete.
	211	Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
	212	Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
	213	This can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section.
	214
	215	In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}.
[f2bc9fa]	216	In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a ``next submitter'' that guarentees everything that is missed by the current submitter is seen by the next one.
[f1bce515]	217	Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request.
	218	Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
	219
	220	Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point.
	221	Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}.
	222	Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
	223	If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
	224	A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled.
	225
	226	With this pool of instances approach, the big advantage is that it is fairly flexible.
	227	It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions.
[f2bc9fa]	228	It also can gracefully handle running out of ressources, SQEs or the kernel returning @EBUSY@.
[f1bce515]	229	The down side to this is that many of the steps used for submitting need complex synchronization to work properly.
	230	The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed.
	231	The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused and handle the kernel returning @EBUSY@.
[f2bc9fa]	232	All this synchronization may have a significant cost and, compared to the next approach presented, this synchronization is entirely overhead.
[52f6250]	233
	234	\subsubsection{Private Instances}
[f1bce515]	235	Another approach is to simply create one ring instance per \gls{proc}.
[f2bc9fa]	236	This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps.
[f1bce515]	237	This is effectively the same requirement as using @thread_local@ variables.
	238	Since SQEs that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit SQEs in allocation order
	239	\footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission.
	240	This requirement means that from the subsystem's point of view, the allocation and submission are sequential.
	241	To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.}
	242	, greatly simplifying both allocation and submission.
[c5af4f9]	243	In this design, allocation and submission form a partitionned ring buffer as shown in Figure~\ref{fig:pring}.
[f1bce515]	244	Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call.
[c5af4f9]	245	Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, etc.
[d4a4b17]	246
	247	\begin{figure}
	248	\centering
	249	\input{pivot_ring.pstex_t}
[f1bce515]	250	\caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appending to the first partition.
	251	When submitting, the partition is advanced.
	252	The kernel considers the partition as the head of the ring.}
[d4a4b17]	253	\label{fig:pring}
	254	\end{figure}
	255
[f1bce515]	256	This approach has the advantage that it does not require much of the synchronization needed in the shared approach.
	257	This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly.
	258	Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed.
	259
	260	A more involved version of this approach can seem to solve most of these problems, using a pattern called \newterm{helping}.
	261	\Glspl{thrd} that wish to submit \io operations but cannot do so
	262	\footnote{either because of an allocation failure or because they were migrate to a different \gls{proc} between allocation and submission}
	263	create an object representing what they wish to achieve and add it to a list somewhere.
	264	For this particular problem, one solution would be to have a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
	265	The problem with these ``solutions'' is that they are still bound by the strong coupling between \glspl{proc} and @io_uring@ instances.
	266	These data structures would allow moving \glspl{thrd} to a specific \gls{proc} when the current \gls{proc} cannot fulfill the \io request.
	267
	268	Imagine a simple case with two \glspl{thrd} on two \glspl{proc}, one \gls{thrd} submits an \io operation and then sets a flag, the other \gls{thrd} spins until the flag is set.
	269	If the first \gls{thrd} is preempted between allocation and submission and moves to the other \gls{proc}, the original \gls{proc} could start running the spinning \gls{thrd}.
	270	If this happens, the helping ``solution'' is for the \io \gls{thrd}to added append an item to the submission list of the \gls{proc} where the allocation was made.
	271	No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}.
	272	However, in this case, the \gls{proc} is unable to help because it is executing the spinning \gls{thrd} mentioned when first expression this case
	273	\footnote{This particular example is completely artificial, but in the presence of many more \glspl{thrd}, it is not impossible that this problem would arise ``in the wild''.
	274	Furthermore, this pattern is difficult to reliably detect and avoid.}
	275	resulting in a deadlock.
	276	Once in this situation, the only escape is to interrupted the execution of the \gls{thrd}, either directly or due to regular preemption, only then can the \gls{proc} take the time to handle the pending request to help.
	277	Interrupting \glspl{thrd} for this purpose is far from desireable, the cost is significant and the situation may be hard to detect.
	278	However, a more subtle reason why interrupting the \gls{thrd} is not a satisfying solution is that the \gls{proc} is not actually using the instance it is tied to.
	279	If it were to use it, then helping could be done as part of the usage.
	280	Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
	281	Therefore a more satisfying solution would be for the \gls{thrd} submitting the operation to simply notice that the instance is unused and simply go ahead and use it.
	282	This is the approach presented next.
[14533d4]	283
	284	\subsubsection{Instance borrowing}
[f1bce515]	285	Both of the approaches presented above have undesirable aspects that stem from too loose or too tight coupling between @io_uring@ and \glspl{proc}.
	286	In the first approach, loose coupling meant that all operations have synchronization overhead that a tighter coupling can avoid.
	287	The second approach on the other hand suffers from tight coupling causing problems when the \gls{proc} do not benefit from the coupling.
	288	While \glspl{proc} are continously issuing \io operations tight coupling is valuable since it avoids synchronization costs.
	289	However, in unlikely failure cases or when \glspl{proc} are not making use of their instance, tight coupling is no longer advantageous.
	290	A compromise between these approaches would be to allow tight coupling but have the option to revoke this coupling dynamically when failure cases arise.
	291	I call this approach ``instance borrowing''\footnote{While it looks similar to work-sharing and work-stealing, I think it is different enough from either to warrant a different verb to avoid confusion.}.
	292
	293	In this approach, each cluster owns a pool of @io_uring@ instances managed by an arbiter.
	294	When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
	295	However, in doing so it ties to the instance to the \gls{proc} it is currently running on.
	296	This coupling is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
	297	This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at any given time, akin to the private instances approach.
	298	However, where it differs is that revocation from the arbiter means this approach does not suffer from the deadlock scenario described above.
	299
	300	Arbitration is needed in the following cases:
	301	\begin{enumerate}
	302	\item The current \gls{proc} does not currently hold an instance.
	303	\item The current instance does not have sufficient SQEs to satisfy the request.
	304	\item The current \gls{proc} has the wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission.
	305	I will refer to these as \newterm{External Submissions}.
	306	\end{enumerate}
	307	However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their ownership of the instance is not being revoked.
	308	This can be accomplished by a lock-less handshake\footnote{Note that the handshake is not Lock-\emph{Free} since it lacks the proper progress guarantee.}.
	309	A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
	310	If not it proceeds, otherwise it delegates the operation to the arbiter.
	311	Once the operation is completed, the \gls{proc} lowers its local flag.
[14533d4]	312
[f1bce515]	313	Correspondingly, before revoking an instance the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
	314	Only then does it reclaim the instance and potentially assign it to an other \gls{proc}.
[14533d4]	315
[f1bce515]	316	The arbiter maintains four lists around which it makes its decisions:
	317	\begin{enumerate}
	318	\item A list of pending submissions.
	319	\item A list of pending allocations.
	320	\item A list of instances currently borrowed by \glspl{proc}.
	321	\item A list of instances currently available.
	322	\end{enumerate}
[14533d4]	323
[f1bce515]	324	\paragraph{External Submissions} are handled by the arbiter by revoking the appropriate instance and adding the submission to the submission ring.
	325	There is no need to immediately revoke the instance however.
	326	External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
	327	This means that whoever is responsible for the system call first checks if the instance has any external submissions.
	328	If it is the case, it asks the arbiter to revoke the instance and add the external submissions to the ring.
[14533d4]	329
[f1bce515]	330	\paragraph{Pending Allocations} can be more complicated to handle.
	331	If the arbiter has available instances, the arbiter can attempt to directly hand over the instance and satisfy the request.
[6db62fa]	332	Otherwise it must hold onto the list of threads until SQEs are made available again.
	333	This handling becomes that much more complex if pending allocation require more than one SQE, since the arbiter must make a decision between statisfying requests in FIFO ordering or satisfy requests for fewer SQEs first.
	334
	335	While this arbiter has the potential to solve many of the problems mentionned in above, it also introduces a significant amount of complexity.
	336	Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation.
	337	Any submission must start with a handshake that pins the currently borrowed instance, if available.
	338	An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
	339	Once the allocation is completed, the submission must still check that the instance is still burrowed before attempt to flush.
	340	These extra synchronization steps end-up having a similar cost to the multiple shared instances approach.
	341	Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration.
	342	Because of these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
	343
	344	\subsubsection{Private Instances V2}
	345
[14533d4]	346
	347
	348	% Verbs of this design
	349
	350	% Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
	351
	352	% Submition: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
	353
	354	% Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
	355
	356	% Collect: Once the system call is done, it returns how many sqes were consumed by the system. These must be freed for allocation. Must interact with the arbiter to notify that things are now ready.
	357
	358	% Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
	359
	360
	361
	362
	363	% alloc():
	364	% proc.io->in_use = true, __ATOMIC_ACQUIRE
	365	% if cltr.io.flag \|\| !proc.io \|\| proc.io->flag:
	366	% return alloc_slow(cltr.io, proc.io)
	367
	368	% a = alloc_fast(proc.io)
	369	% if a:
	370	% proc.io->in_use = false, __ATOMIC_RELEASE
	371	% return a
	372
	373	% return alloc_slow(cltr.io)
	374
	375	% alloc_fast()
	376	% left = proc.io->submit_q.free.tail - proc.io->submit_q.free.head
	377	% if num_entries - left < want:
	378	% return None
	379
	380	% a = ready[head]
	381	% head = head + 1, __ATOMIC_RELEASE
	382
	383	% alloc_slow()
	384	% cltr.io.flag = true, __ATOMIC_ACQUIRE
	385	% while(proc.io && proc.io->in_use) pause;
	386
	387
	388
	389	% submit(a):
	390	% proc.io->in_use = true, __ATOMIC_ACQUIRE
	391	% if cltr.io.flag \|\| proc.io != alloc.io \|\| proc.io->flag:
	392	% return submit_slow(cltr.io)
[52f6250]	393
[14533d4]	394	% submit_fast(proc.io, a)
	395	% proc.io->in_use = false, __ATOMIC_RELEASE
[52f6250]	396
[14533d4]	397	% polling()
	398	% loop:
	399	% yield
	400	% flush()
	401	% io_uring_enter
	402	% collect
	403	% handle()
[d4a4b17]	404
[86c1f1c3]	405	\section{Interface}
[d4a4b17]	406	Finally, the last important part of the \io subsystem is it's interface. There are multiple approaches that can be offered to programmers, each with advantages and disadvantages. The new \io subsystem can replace the C runtime's API or extend it. And in the later case the interface can go from very similar to vastly different. The following sections discuss some useful options using @read@ as an example. The standard Linux interface for C is :
[c292244]	407
[3112733]	408	@ssize_t read(int fd, void *buf, size_t count);@
[c292244]	409
	410	\subsection{Replacement}
[3112733]	411	Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
	412	The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
	413	This has the advantage of potentially working transparently and supporting existing binaries without needing recompilation.
	414	It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
	415	However, this approach also entails a plethora of subtle technical challenges which generally boils down to making a perfect replacement.
	416	If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
	417	Since the gcc ecosystems does not offer a scheme for such perfect replacement, this approach was rejected as being laudable but infeasible.
[c292244]	418
	419	\subsection{Synchronous Extension}
[3112733]	420	An other interface option is to simply offer an interface that is different in name only. For example:
	421
	422	@ssize_t cfa_read(int fd, void *buf, size_t count);@
	423
	424	\noindent This is much more feasible but still familiar to C programmers.
	425	It comes with the caveat that any code attempting to use it must be recompiled, which can be a big problem considering the amount of existing legacy C binaries.
	426	However, it has the advantage of implementation simplicity.
[c292244]	427
	428	\subsection{Asynchronous Extension}
[3112733]	429	It is important to mention that there is a certain irony to using only synchronous, therefore blocking, interfaces for a feature often referred to as ``non-blocking'' \io.
	430	A fairly traditional way of doing this is using futures\cit{wikipedia futures}.
	431	As simple way of doing so is as follows:
	432
	433	@future(ssize_t) read(int fd, void *buf, size_t count);@
	434
	435	\noindent Note that this approach is not necessarily the most idiomatic usage of futures.
	436	The definition of read above ``returns'' the read content through an output parameter which cannot be synchronized on.
	437	A more classical asynchronous API could look more like:
	438
	439	@future([ssize_t, void *]) read(int fd, size_t count);@
	440
	441	\noindent However, this interface immediately introduces memory lifetime challenges since the call must effectively allocate a buffer to be returned.
	442	Because of the performance implications of this, the first approach is considered preferable as it is more familiar to C programmers.
[c292244]	443
[c6640a3]	444	\subsection{Interface directly to \lstinline{io_uring}}
[3112733]	445	Finally, an other interface that can be relevant is to simply expose directly the underlying \texttt{io\_uring} interface. For example:
	446
	447	@array(SQE, want) cfa_io_allocate(int want);@
	448
	449	@void cfa_io_submit( const array(SQE, have) & );@
	450
	451	\noindent This offers more flexibility to users wanting to fully use all of the \texttt{io\_uring} features.
	452	However, it is not the most user-friendly option.
	453	It obviously imposes a strong dependency between user code and \texttt{io\_uring} but at the same time restricting users to usages that are compatible with how \CFA internally uses \texttt{io\_uring}.
	454
	455

Note: See TracBrowser for help on using the repository browser.

Download in other formats: