source: doc/theses/thierry_delisle_PhD/thesis/text/io.tex @ 14533d4

arm-ehjacob/cs343-translationnew-ast-unique-expr
Last change on this file since 14533d4 was 14533d4, checked in by Thierry Delisle <tdelisle@…>, 8 months ago

commiting before merge with peter's comments

  • Property mode set to 100644
File size: 33.6 KB
Line 
1\chapter{User Level \io}
2As mentionned in Section~\ref{prev:io}, User-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations. Various operating systems offer various forms of asynchronous operations and as mentioned in Chapter~\ref{intro}, this work is exclusively focuesd on Linux.
3
4\section{Kernel Interface}
5Since this work fundamentally depends on operating system support, the first step of any design is to discuss the available interfaces and pick one (or more) as the foundations of the \io subsystem.
6
7\subsection{\lstinline|O_NONBLOCK|}
8In Linux, files can be opened with the flag @O_NONBLOCK@~\cite{MAN:open} (or @SO_NONBLOCK@~\cite{MAN:accept}, the equivalent for sockets) to use the file descriptors in ``nonblocking mode''. In this mode, ``Neither the open() nor any subsequent \io operations on the [opened file descriptor] will cause the calling
9process to wait.'' This feature can be used as the foundation for the \io subsystem. However, for the subsystem to be able to block \glspl{thrd} until an operation completes, @O_NONBLOCK@ must be use in conjunction with a system call that monitors when a file descriptor becomes ready, \ie, the next \io operation on it will not cause the process to wait\footnote{In this context, ready means to \emph{some} operation can be performed without blocking. It does not mean that the last operation that return \lstinline|EAGAIN| will succeed on the next try. A file that is ready to read but has only 1 byte available would be an example of this distinction.}.
10
11There are three options to monitor file descriptors in Linux\footnote{For simplicity, this section omits to mention \lstinline|pselect| and \lstinline|ppoll|. The difference between these system calls and \lstinline|select| and \lstinline|poll| respectively is not relevant for this discussion.}, @select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}. All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptor becomes ready. The group of file descriptors being waited on is often referred to as the \newterm{interest set}.
12
13\paragraph{\lstinline|select|} is the oldest of these options, it takes as an input a contiguous array of bits, where each bits represent a file descriptor of interest. On return, it modifies the set in place to identify which of the file descriptors changed status. This means that calling select in a loop requires re-initializing the array each time and the number of file descriptors supported has a hard limit. Another limit of @select@ is that once the call is started, the interest set can no longer be modified. Monitoring a new file descriptor generally requires aborting any in progress call to @select@\footnote{Starting a new call to \lstinline|select| in this case is possible but requires a distinct kernel thread, and as a result is not a acceptable multiplexing solution when the interest set is large and highly dynamic unless the number of parallel calls to select can be strictly bounded.}.
14
15\paragraph{\lstinline|poll|} is an improvement over select, which removes the hard limit on the number of file descriptors and the need to re-initialize the input on every call. It works using an array of structures as an input rather than an array of bits, thus allowing a more compact input for small interest sets. Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed while the call is blocked.
16
17\paragraph{\lstinline|epoll|} further improves on these two functions, by allowing the interest set to be dynamically added to and removed from while a \gls{kthrd} is blocked on a call to @epoll@. This is done by creating an \emph{epoll instance} with a persistent intereset set and that is used across multiple calls. This advantage significantly reduces synchronization overhead on the part of the caller (in this case the \io subsystem) since the interest set can be modified when adding or removing file descriptors without having to synchronize with other \glspl{kthrd} potentially calling @epoll@.
18
19However, all three of these system calls suffer from generality problems to some extent. The man page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations. Furthermore, @epoll@ has been shown to have some problems with pipes and ttys\cit{Peter's examples in some fashion}. Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
20
21\subsection{The POSIX asynchronous I/O (AIO)}
22An alternative to using @O_NONBLOCK@ is to use the AIO interface. Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. Completions of these operations can be communicated in various ways, either by sending a Linux signal, spawning a new \gls{kthrd} or by polling for completion of one or more operation. For the purpose multiplexing operations, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}. Since using interrupts handlers can also lead to fairly complicated interactions between subsystems, I will concentrate on the different polling methods. AIO only supports read and write operations to file descriptors and those do not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices. It also supports batching more than one of these operations in a single system call.
23
24AIO offers two different approach to polling. @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed. For the purpose of \io multiplexing, @aio_suspend@ is the intended interface. Even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. Unlike @select@ and @poll@ however, it also suffers from the limitation that it does not specify which requests have completed, meaning programmers then have to poll each request in the interest set using @aio_error@ to identify which requests have completed. This means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based in the total number of requests monitored, not the number of completed requests.
25
26AIO does not seem to be a particularly popular interface, which I believe is in part due to this less than ideal polling interface. Linus Torvalds talks about this interface as follows :
27
28\begin{displayquote}
29        AIO is a horrible ad-hoc design, with the main excuse being "other,
30        less gifted people, made that design, and we are implementing it for
31        compatibility because database people - who seldom have any shred of
32        taste - actually use it".
33
34        But AIO was always really really ugly.
35
36        \begin{flushright}
37                -- Linus Torvalds\cit{https://lwn.net/Articles/671657/}
38        \end{flushright}
39\end{displayquote}
40
41Interestingly, in this e-mail answer, Linus goes on to describe
42``a true \textit{asynchronous system call} interface''
43that does
44``[an] arbitrary system call X with arguments A, B, C, D asynchronously using a kernel thread''
45in
46``some kind of arbitrary \textit{queue up asynchronous system call} model''.
47This description is actually quite close to the interface described in the next section.
48
49\subsection{\lstinline|io_uring|}
50A very recent addition to Linux, @io_uring@\cite{MAN:io_uring} is a framework that aims to solve many of the problems listed with the above mentioned interfaces. Like AIO, it represents \io operations as entries added on a queue. But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress. The @io_uring@ interface uses two ring buffers (referred to simply as rings) as its core, a submit ring to which programmers push \io requests and a completion buffer which programmers poll for completion.
51
52One of the big advantages over the interfaces listed above is that it also supports a much wider range of operations. In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
53
54On top of these, @io_uring@ adds many ``bells and whistles'' like avoiding copies between the kernel and user-space with shared memory, allowing different mechanisms to communicate with device drivers and supporting chains of requests, \ie, requests that automatically trigger followup requests on completion.
55
56\subsection{Extra Kernel Threads}\label{io:morethreads}
57Finally, if the operating system does not offer any satisfying forms of asynchronous \io operations, a solution is to fake it by creating a pool of \glspl{kthrd} and delegating operations to them in order to avoid blocking \glspl{proc}. The is a compromise on multiplexing. In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading. However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run. This approach is used by languages like Go\cit{Go} and frameworks like libuv\cit{libuv}, since it has the advantage that it can easily be used across multiple operating systems. This advantage is especially relevant for languages like Go, which offer an homogenous \glsxtrshort{api} across all platforms. As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
58
59\subsection{Discussion}
60These options effectively fall into two broad camps of solutions, waiting for \io to be ready versus waiting for \io to be completed. All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details can vary drastically. For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue} which behaves similarly to @epoll@ but with some small quality of life improvements, while Windows (Win32)~\cit{https://docs.microsoft.com/en-us/windows/win32/fileio/synchronous-and-asynchronous-i-o} offers ``overlapped I/O'' which handles submissions similarly to @O_NONBLOCK@, with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
61
62For this project, I have chosen to use @io_uring@, in large parts due to its generality. While @epoll@ has been shown to be a good solution to socket \io (\cite{DBLP:journals/pomacs/KarstenB20}), @io_uring@'s transparent support for files, pipes and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
63
64\section{Event-Engine}
65
66The event engines reponsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. In concrete terms, this means that \glspl{thrd} enter the engine through an interface, the event engines then starts the operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}. The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed.
67
68\subsection{\lstinline|io_uring| in depth}
69Before going into details on the design of the event engine, I will present some more details on the usage of @io_uring@ which are important for the design of the engine.
70
71\begin{figure}
72        \centering
73        \input{io_uring.pstex_t}
74        \caption[Overview of \lstinline|io_uring|]{Overview of \lstinline|io_uring| \smallskip\newline Two ring buffer are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The completion ring contains entries, \newterm{CQE}s: Completion Queue Entries, that are produced by the kernel when an operation completes and then consumed by the application. On the other hand, the application produces \newterm{SQE}s: Submit Queue Entries, which it appends to the submission ring for the kernel to consume. Unlike the completion ring, the submission ring does not contain the entries directly, it indexes into the SQE array (denoted \emph{S}) instead.}
75        \label{fig:iouring}
76\end{figure}
77
78Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. Multiple @io_uring@ instances can be created, in which case they each have a copy of the data structures in the figure. New \io operations are submitted to the kernel following 4 steps which use the components shown in the figure.
79
80\paragraph{First} an @sqe@ must be allocated from the pre-allocated array (denoted \emph{S} in Figure~\ref{fig:iouring}). This array is created at the same time as the @io_uring@ instance, is in kernel-locked memory, which means it is both visible by the kernel and the application, and has a fixed size determined at creation. How these entries are allocated is not important for the functionning of @io_uring@, the only requirement is that no entry is reused before the kernel has consumed it.
81
82\paragraph{Secondly} the @sqe@ must be filled according to the desired operation. This step is straight forward, the only detail worth mentionning is that @sqe@s have a @user_data@ field that must be filled in order to match submission and completion entries.
83
84\paragraph{Thirdly} the @sqe@ must be submitted to the submission ring, this requires appending the index of the @sqe@ to the ring following regular ring buffer steps: \lstinline|{ buffer[head] = item; head++ }|. Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations. Since the submission ring is a regular ring buffer, more than one @sqe@ can be added at once and the head can be updated only after the entire batch has been updated.
85
86\paragraph{Finally} the kernel must be notified of the change to the ring using the system call @io_uring_enter@. The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned. The @io_uring@ instance can be constructed so that this step is not required, but this requires elevated privilege and early version of @io_uring@ had additionnal restrictions.
87
88The completion side is simpler, applications call @io_uring_enter@ with the flag @IORING_ENTER_GETEVENTS@ to wait on a desired number of operations to complete. The same call can be used to both submit @sqe@s and wait for operations to complete. When operations do complete the kernel appends a @cqe@ to the completion ring and advances the head of the ring. Each @cqe@ contains the result of the operation as well as a copy of the @user_data@ field of the @sqe@ that triggered the operation. It is not necessary to call @io_uring_enter@ to get new events, the kernel can directly modify the completion ring, the system call is only needed if the application wants to block waiting on operations to complete.
89
90The @io_uring_enter@ system call is protected by a lock inside the kernel. This means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is can be no performance gained from parallel calls to @io_uring_enter@. It is possible to do the first three submission steps in parallel, however, doing so requires careful synchronization.
91
92@io_uring@ also introduces some constraints on what the number of operations that can be ``in flight'' at the same time. Obviously, @sqe@s are allocated from a fixed-size array, meaning that there is a hard limit to how many @sqe@s can be submitted at once. In addition, the @io_uring_enter@ system call can fail because ``The  kernel [...] ran out of resources to handle [a request]'' or ``The application is attempting to overcommit the number of requests it can  have  pending.''. This requirement means that it can be required to handle bursts of \io requests by holding back some of the requests so they can be submitted at a later time.
93
94\subsection{Multiplexing \io: Submission}
95The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
96
97While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted. The consequence of this is that how much parallelism can be used to prepare submissions for the next system call is limited. Beyond this limit, the length of the system call will be the throughput limiting factor. I have concluded from early experiments that preparing submissions seems to take about as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two. Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously\footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}.
98
99\subsubsection{Shared Instances}
100One approach is to have multiple shared instances. \Glspl{thrd} attempting \io operations pick one of the available instances and submit operations to that instance. Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently. Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention. This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
101
102Allocation in this scheme can be handled fairly easily. Free @sqe@s, \ie, @sqe@s that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s. Allocation also requires no ordering guarantee as all free @sqe@s are interchangeable. This requires a simple concurrent bag. The only added complexity is that the number of @sqe@s is fixed, which means allocation can fail. This is made worst from the fact that @io_uring@ users can chain together operations, in which case all @sqe@s forming a chain must be allocated from the same instance.
103
104Allocation failures need to be pushed up to the routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient @sqe@s available. Furthermore, the routing algorithm should block operations up-front if none of the instances have available @sqe@s.
105
106Once an @sqe@ is allocated, \glspl{thrd} can fill them normally, they simply need to keep trac of the @sqe@ index and which instance it belongs to.
107
108Once an @sqe@ is filled in, what needs to happen is that the @sqe@ must be added to the submission ring buffer, an operation that is not thread-safe on itself, and the kernel must be notified using the @io_uring_enter@ system call. The submission ring buffer is the same size as the pre-allocated @sqe@ buffer, therefore pushing to the ring buffer cannot fail\footnote{This is because it is invalid to have the same \lstinline|sqe| multiple times in the ring buffer.}. However, as mentioned, the system call itself can fail with the expectation that it will be retried once some of the already submitted operations complete. Since multiple @sqe@s can be submitted to the kernel at once, it is important to strike a balance between batching and latency. Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted. This can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of @sqe@s or by having some other party regularly submitting all ready @sqe@s, \eg, the poller \gls{thrd} mentionned later in this section.
109
110In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}. In practice however, it is important that the \io requests are not left pending indefinately and as such, it may be required to have a current submitter and a next submitter. Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request. Once the system call is done, the submitter must also free @sqe@s so that the allocator can reused them.
111
112Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point. Polling simply needs to regularly do the system call, go through the produced @cqe@s and communicate the result back to the originating \glspl{thrd}. Since @cqe@s only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. If the submission side does not designate submitters, polling can also submit all @sqe@s as it is polling events.  A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled. This design is especially convinient for reasons explained in Chapter~\ref{practice}.
113
114With this pool of instances approach, the big advantage is that it is fairly flexible. It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handles running out of ressources, @sqe@s or the kernel returning @EBUSY@. The down side to this is that many of the steps used for submitting need complex synchronization to work properly. The routing and allocation algorithm needs to keep track of which ring instances have available @sqe@s, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for @sqe@s and handle @sqe@s being freed. The submission side needs to safely append @sqe@s to the ring buffer, make sure no @sqe@ is dropped or left pending forever, notify the allocation side when @sqe@s can be reused and handle the kernel returning @EBUSY@. All this synchronization may have a significant cost and, compare to the next approach presented, this synchronization is entirely overhead.
115
116\subsubsection{Private Instances}
117Another approach is to simply create one ring instance per \gls{proc}. This alleviate the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. This is effectively the same requirement as using @thread_local@ variables. Since @sqe@s that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit @sqe@s in allocation order\footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission. This requirement means that from the subsystem's point of view, the allocation and submission are sequential. To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.}, greatly simplifying both allocation and submission. In this design, allocation and submission form a ring partitionned ring buffer as shown in Figure~\ref{fig:pring}. Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call. Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of threads \glspl{thrd}, etc.
118
119\begin{figure}
120        \centering
121        \input{pivot_ring.pstex_t}
122        \caption[Partitionned ring buffer]{Partitionned ring buffer \smallskip\newline Allocated sqes are appending to the first partition. When submitting, the partition is simply advanced to include all the sqes that should be submitted. The kernel considers the partition as the head of the ring.}
123        \label{fig:pring}
124\end{figure}
125
126This approach has the advantage that it does not require much of the synchronization needed in the shared approach. This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly. Instances running out of @sqe@s cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed.
127
128A more involved version of this approach can seem to solve most of these problems, using a pattern called \newterm{helping}. \Glspl{thrd} that which to submit \io operations but cannot do so, either because of an allocation failure or because they were migrate to a different \gls{proc} between allocation and submission, create an object representing what they wish to achieve and add it to a list somewhere. For this particular problem, one solution would be to have a list per \gls{proc} of submissions that could not be completed because the thread was moved, and another list, probably per cluster, of \glspl{thrd} that where unable to allocate enough @sqe@s. The problem with these ``solutions'' is that they are still bound by the strong coupling between \glspl{proc} and @io_uring@ instances. Imagine a simple case with two \glspl{thrd} on two \glspl{proc}, one \gls{thrd} submits an \io operation and then sets a flag, the other \gls{thrd} spins until the flag is set. If the first \gls{thrd} is preempted between allocation and submission and moves to the other \gls{proc}, the original \gls{proc} could start running the spinning \gls{thrd}. If this happens, the helping ``solution'' is for the \gls{thrd} wanting to submit is \io operation to added append an item to a list belonging to the \gls{proc} where the allocation was made. No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}. However, in this case, the \gls{proc} is unable to help because it is executing the spinning \gls{thrd} mentioned when first expression this case\footnote{This particular example is completely artificial, but in the presence of many more \glspl{thrd}, it is not impossible that this problem would arise ``in the wild'' and could be difficult for users to reliably detect and avoid.}. Once in this situation, the only escape is to interrupted the execution of the \gls{thrd}, either directly or due to regular preemption, only then can the \gls{proc} take the time to handle the pending request to help. Interrupting \glspl{thrd} for this purpose is far from desireable, the cost is significant and the situation may be hard to detect. However, a more subtle reason why interrupting the \gls{thrd} is not a satisfying solution comes from the fact that the \gls{proc} is not actually using the instance it is tied to. If it were to use it, then helping could be done as part of the usage. Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using. Therefore a more satisfying solution would be for the \gls{thrd} submitting the operation to simply notice that the instance is unused and simply go ahead and use it. This is the approach presented next.
129
130\subsubsection{Instance borrowing}
131Both of the approaches presented above have higly undesirable aspects that stem from too loose coupling or too tight coupling between @io_uring@ and \glspl{proc}. In the first approach, loose coupling meant that all operations have synchronization overhead that a tighter coupling can avoid. The second approach on the other hand suffers from tight coupling causing problems when the \gls{proc} do not benefit from the coupling. While \glspl{proc} are continously issuing \io operations tight coupling is valuable since it avoids synchronization costs. However, in unlikely failure cases or when \glspl{proc} are not making use of their instance, tight coupling is no longer advantageous. A compromise between these approaches would be to allow tight coupling but have the option to revoke this coupling dynamically when failure cases arise. I call this approach ``instance borrowing''\footnote{While it looks similar to work-sharing and work-stealing, I think it is different enough from either to warrant a different verb to avoid confusion.}.
132
133In this approach, each cluster owns a pool of @io_uring@ instances managed by an arbiter. When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. However, in doing so it ties to the instance to the \gls{proc} it is currently running on. This coupling is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io. This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at any given time, akin to the private instances approach. However, where it differs is that revocation from the arbiter means that this approach does not suffer from the potential deadlock described above.
134
135% Since private instances appear to work well in the easy case, an intersting option would be keep instances private while it is convenient but
136
137
138% Arbitration is needed in the following cases
139% \begin{enumerate}
140%       \item The instance does not have sufficient @sqe@s to satisfy the request.
141%       \item The current \gls{proc} does not currently hold an instance.
142%       \item The current \gls{proc} has the wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission.
143% \end{enumerate}
144% In all cases
145
146
147% Verbs of this design
148
149% Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
150
151% Submition: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
152
153% Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
154
155% Collect: Once the system call is done, it returns how many sqes were consumed by the system. These must be freed for allocation. Must interact with the arbiter to notify that things are now ready.
156
157% Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
158
159
160
161
162% alloc():
163%       proc.io->in_use = true, __ATOMIC_ACQUIRE
164%       if cltr.io.flag || !proc.io || proc.io->flag:
165%               return alloc_slow(cltr.io, proc.io)
166
167%       a = alloc_fast(proc.io)
168%       if a:
169%               proc.io->in_use = false, __ATOMIC_RELEASE
170%               return a
171
172%       return alloc_slow(cltr.io)
173
174% alloc_fast()
175%       left = proc.io->submit_q.free.tail - proc.io->submit_q.free.head
176%       if num_entries - left < want:
177%               return None
178
179%       a = ready[head]
180%       head = head + 1, __ATOMIC_RELEASE
181
182% alloc_slow()
183%       cltr.io.flag = true, __ATOMIC_ACQUIRE
184%       while(proc.io && proc.io->in_use) pause;
185
186
187
188% submit(a):
189%       proc.io->in_use = true, __ATOMIC_ACQUIRE
190%       if cltr.io.flag || proc.io != alloc.io || proc.io->flag:
191%               return submit_slow(cltr.io)
192
193%       submit_fast(proc.io, a)
194%       proc.io->in_use = false, __ATOMIC_RELEASE
195
196% polling()
197%       loop:
198%               yield
199%               flush()
200%               io_uring_enter
201%               collect
202%               handle()
203
204\section{Interface}
205Finally, the last important part of the \io subsystem is it's interface. There are multiple approaches that can be offered to programmers, each with advantages and disadvantages. The new \io subsystem can replace the C runtime's API or extend it. And in the later case the interface can go from very similar to vastly different. The following sections discuss some useful options using @read@ as an example. The standard Linux interface for C is :
206
207@ssize_t read(int fd, void *buf, size_t count);@.
208
209\subsection{Replacement}
210Replacing the C \glsxtrshort{api}
211
212\subsection{Synchronous Extension}
213
214\subsection{Asynchronous Extension}
215
216\subsection{Interface directly to \lstinline|io_uring|}
Note: See TracBrowser for help on using the repository browser.