Changes in / [b0ceb72:1260224]


Ignore:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    rb0ceb72 r1260224  
    11\chapter{User Level \io}\label{userio}
    22As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
    3 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
     3Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
    44
    55\section{Kernel Interface}
     
    1313In this context, ready means \emph{some} operation can be performed without blocking.
    1414It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try.
    15 For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
     15For example, a ready read may only return a subset of requested bytes and the read must be issued again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.}
    1616This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
    1717
    1818There are three options to monitor file descriptors in Linux:\footnote{
    1919For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
    20 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.},
     20The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
    2121@select@~\cite{MAN:select}, @poll@~\cite{MAN:poll} and @epoll@~\cite{MAN:epoll}.
    2222All three of these options offer a system call that blocks a \gls{kthrd} until at least one of many file descriptors becomes ready.
     
    3333Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem.
    3434
    35 % \begin{comment}
    36 % From: Tim Brecht <brecht@uwaterloo.ca>
    37 % Subject: Re: FD sets
    38 % Date: Wed, 6 Jul 2022 00:29:41 +0000
    39 
    40 % Large number of open files
    41 % --------------------------
    42 
    43 % In order to be able to use more than the default number of open file
    44 % descriptors you may need to:
    45 
    46 % o increase the limit on the total number of open files /proc/sys/fs/file-max
    47 %   (on Linux systems)
    48 
    49 % o increase the size of FD_SETSIZE
    50 %   - the way I often do this is to figure out which include file __FD_SETSIZE
    51 %     is defined in, copy that file into an appropriate directory in ./include,
    52 %     and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size
    53 %     gets used
    54 
    55 %   For example on a RH 9.0 distribution I've copied
    56 %   /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h
    57 
    58 %   Then I modify typesizes.h to look something like:
    59 
    60 %   #ifdef BIGGER_FD_SETSIZE
    61 %   #define __FD_SETSIZE            32767
    62 %   #else
    63 %   #define __FD_SETSIZE            1024
    64 %   #endif
    65 
    66 %   Note that the since I'm moving and testing the userver on may different
    67 %   machines the Makefiles are set up to use -I ./include/$(HOSTTYPE)
    68 
    69 %   This way if you redefine the FD_SETSIZE it will get used instead of the
    70 %   default original file.
    71 % \end{comment}
    72 
    7335\paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs.
    7436For small interest sets with densely packed FDs, the @select@ bit mask can take less storage, and hence, copy less information into the kernel.
    75 Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialize on every call.
    76 Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \gls{kthrd}, while a manager thread is blocked in @poll@.
     37Furthermore, @poll@ is non-destructive, so the array of structures does not have to be re-initialized on every call.
     38Like @select@, @poll@ suffers from the limitation that the interest set cannot be changed by other \glspl{kthrd}, while a manager thread is blocked in @poll@.
    7739
    7840\paragraph{\lstinline{epoll}} follows after @poll@, and places the interest set in the kernel rather than the application, where it is managed by an internal \gls{kthrd}.
     
    9052An alternative to @O_NONBLOCK@ is the AIO interface.
    9153Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
    92 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or by polling for completion of one or more operation.
     54Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
    9355For this work, spawning a new \gls{kthrd} is counter-productive but a related solution is discussed in Section~\ref{io:morethreads}.
    94 Using interrupts handlers can also lead to fairly complicated interactions between subsystems and has non-trivial cost.
     56Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
    9557Leaving polling for completion, which is similar to the previous system calls.
    9658AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
    9759It also supports batching multiple operations in a single system call.
    9860
    99 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have completed.
    100 For the purpose of \io multiplexing, @aio_suspend@ is the best interface.
     61AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
     62For \io multiplexing, @aio_suspend@ is the best interface.
    10163However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
    102 AIO also suffers from the limitation of specifying which requests have completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
     64AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
    10365This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
    10466Finally, AIO does not seem to be a popular interface, which I believe is due in part to this poor polling interface.
     
    12486in
    12587``some kind of arbitrary \textit{queue up asynchronous system call} model''.
    126 This description is actually quite close to the interface described in the next section.
     88This description is quite close to the interface described in the next section.
    12789
    12890\subsection{\lstinline{io_uring}}
     
    13597In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
    13698
    137 On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user-space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger followup requests on completion.
     99On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
    138100
    139101\subsection{Extra Kernel Threads}\label{io:morethreads}
     
    143105This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    144106This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
    145 As opposed to C, which has a very limited standard api for \io, \eg, the C standard library has no networking.
     107As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
    146108
    147109\subsection{Discussion}
     
    156118An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    157119In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
    158 The parked \ats are then rescheduled by the event engine once the desired operation has completed.
     120The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
    159121
    160122\subsection{\lstinline{io_uring} in depth}\label{iouring}
     
    171133        \centering
    172134        \input{io_uring.pstex_t}
    173         \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffer are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
     135        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
    174136        \label{fig:iouring}
    175137\end{figure}
     
    184146\item
    185147The SQE is filled according to the desired operation.
    186 This step is straight forward.
    187 The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled in order to match submission and completion entries.
     148This step is straightforward.
     149The only detail worth mentioning is that SQEs have a @user_data@ field that must be filled to match submission and completion entries.
    188150\item
    189151The SQE is submitted to the submission ring by appending the index of the SQE to the ring following regular ring buffer steps: \lstinline{buffer[head] = item; head++}.
     
    207169
    208170The @io_uring_enter@ system call is protected by a lock inside the kernel.
    209 This protection means that concurrent call to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
     171This protection means that concurrent calls to @io_uring_enter@ using the same instance are possible, but there is no performance gained from parallel calls to @io_uring_enter@.
    210172It is possible to do the first three submission steps in parallel;
    211173however, doing so requires careful synchronization.
     
    216178This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    217179
    218 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operation using someone else's threads.
    219 Indeed, asynchronous operation can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
     180An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
     181Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
    220182In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
    221183To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
    222 There are three kind of operations that can need the \glspl{kthrd}:
     184Three kinds of operations that can need the \glspl{kthrd}:
    223185
    224186\paragraph{Operations using} @IOSQE_ASYNC@.
     
    228190This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
    229191@io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
    230 In fact @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
     192In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
    231193
    232194\paragraph{Unbounded operations that must be retried.}
     
    235197@io_uring@ maintains a separate pool for these operations.
    236198The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
    237 Unbounded workers are also responsible of handling operations using @IOSQE_ASYNC@.
     199Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
    238200
    239201@io_uring@ implicitly spawns and joins both the bounded and unbounded workers based on its evaluation of the needs of the workload.
    240202This effectively encapsulates the work that is needed when using @epoll@.
    241 Indeed, @io_uring@ does not change Linux's underlying handling of \io opeartions, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system.
     203Indeed, @io_uring@ does not change Linux's underlying handling of \io operations, it simply offers an asynchronous \glsxtrshort{api} on top of the existing system.
    242204
    243205
    244206\subsection{Multiplexing \io: Submission}
    245207
    246 The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
     208The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made on the submission side.
    247209While there is freedom in designing the submission side, there are some realities of @io_uring@ that must be taken into account.
    248210It is possible to do the first steps of submission in parallel;
     
    255217As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
    256218Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
    257 There is nothing preventing a new operation with, \eg the same file descriptors to a different @io_uring@ instance.
     219Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance.
    258220
    259221A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
     
    263225Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
    264226
    265 Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
     227Similar to scheduling, sharding @io_uring@ instances can be done privately, \ie, one instance per \proc, in decoupled pools, \ie, a pool of \procs using a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two.
    266228These three sharding approaches are analyzed.
    267229
     
    270232This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps.
    271233This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data.
    272 This failure is the serially reusable problem~\cite{SeriallyReusable}.
    273 Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{
     234This failure is the \newterm{serially reusable problem}~\cite{SeriallyReusable}.
     235Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in order of allocation.\footnote{
    274236To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
    275237From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
    276238In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
    277 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call.
     239Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
    278240Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
    279241
     
    281243        \centering
    282244        \input{pivot_ring.pstex_t}
    283         \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appending to the first partition.
     245        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.
    284246        When submitting, the partition is advanced.
    285247        The kernel considers the partition as the head of the ring.}
     
    294256A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
    295257\ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    296 While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
     258While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
    297259
    298260Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set.
     
    301263No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
    302264However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
    303 While this example is artificial, in the presence of many \ats, it is possible for this problem to arise ``in the wild''.
     265While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
    304266Furthermore, this pattern is difficult to reliably detect and avoid.
    305 Once in this situation, the only escape is to interrupted the spinning \at, either directly or via some regular preemption, \eg time slicing.
     267Once in this situation, the only escape is to interrupt the spinning \at, either directly or via some regular preemption, \eg time slicing.
    306268Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect.
    307269Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using.
     
    319281\item
    320282The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
    321 This aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
     283This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
    322284\end{itemize}
    323285
    324286Allocation in this scheme is fairly easy.
    325 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written to safely and have a field called @user_data@ that the kernel only reads to copy to @cqe@s.
    326 Allocation also requires no ordering guarantee as all free SQEs are interchangeable.
     287Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.
     288Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
    327289The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
    328290
     
    330292Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs.
    331293
    332 Once an SQE is allocated, \ats insert the \io request information, and keep track of the SQE index and the instance it belongs to.
     294Once an SQE is allocated, \ats insert the \io request information and keep track of the SQE index and the instance it belongs to.
    333295
    334296Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call.
    335 The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean a \lstinline{sqe} multiple times in the ring buffer, which is undefined behaviour.
    336 However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations complete.
     297The submission ring buffer is the same size as the pre-allocated SQE buffer, therefore pushing to the ring buffer cannot fail because it would mean an SQE multiple times in the ring buffer, which is undefined behaviour.
     298However, as mentioned, the system call itself can fail with the expectation that it can be retried once some submitted operations are complete.
    337299
    338300Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency.
    339 Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted.
    340 Balancing submission can be handled by either designating one of the submitting \ats as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \at mentioned later in this section.
     301Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long periods before being submitted.
     302Balancing submission can be handled by either designating one of the submitting \ats as the \at responsible for the system call for the current batch of SQEs or by having some other party regularly submit all ready SQEs, \eg, the poller \at mentioned later in this section.
    341303
    342304Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}.
    343305However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one.
    344306Indeed, as long as there is a ``next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request.
    345 Once the system call is done, the submitter must also free SQEs so that the allocator can reused them.
     307Once the system call is done, the submitter must also free SQEs so that the allocator can reuse them.
    346308
    347309Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point.
    348310Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats.
    349 Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
     311Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    350312If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    351313A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
     
    354316It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    355317It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
    356 The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly.
     318The downside to this approach is that many of the steps used for submitting need complex synchronization to work properly.
    357319The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    358320The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    359 All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead.
     321Compared to the private-instance approach, all this synchronization has a significant cost this synchronization is entirely overhead.
    360322
    361323\subsubsection{Instance borrowing}
    362324Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
    363325The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
    364 The second approach suffers from loose coupling causing operations to have synchronization overhead, which tighter coupling avoids.
     326The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.
    365327When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
    366328However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
     
    372334When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
    373335This instance is now bound to the \gls{proc} the \at is running on.
    374 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io.
     336This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
    375337This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach.
    376338However, it differs in that revocation by the arbiter means this approach does not suffer from the deadlock scenario described above.
     
    383345\end{enumerate}
    384346However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
    385 Note the handshake is not lock \emph{free} since it lacks the proper progress guarantee.}
     347Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}
    386348A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
    387349If not, it proceeds, otherwise it delegates the operation to the arbiter.
     
    389351
    390352Correspondingly, before revoking an instance, the arbiter marks the instance and then waits for the \gls{proc} using it to lower its local flag.
    391 Only then does it reclaim the instance and potentially assign it to an other \gls{proc}.
     353Only then does it reclaim the instance and potentially assign it to another \gls{proc}.
    392354
    393355The arbiter maintains four lists around which it makes its decisions:
     
    410372
    411373While an arbiter has the potential to solve many of the problems mentioned above, it also introduces a significant amount of complexity.
    412 Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation.
     374Tracking which processors are borrowing which instances and which instances have SQEs available ends up adding a significant synchronization prelude to any I/O operation.
    413375Any submission must start with a handshake that pins the currently borrowed instance, if available.
    414376An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}.
    415377Once the allocation is completed, the submission must check that the instance is still burrowed before attempting to flush.
    416378These synchronization steps turn out to have a similar cost to the multiple shared-instances approach.
    417 Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration.
     379Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end up cycling the processors, which leads to significant cache deterioration.
    418380For these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice.
    419 
    420 \subsubsection{Private Instances V2}
    421 
    422 % Verbs of this design
    423 
    424 % Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
    425 
    426 % Submission: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
    427 
    428 % Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
    429 
    430 % Collect: Once the system call is done, it returns how many sqes were consumed by the system. These must be freed for allocation. Must interact with the arbiter to notify that things are now ready.
    431 
    432 % Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
    433 
    434 
    435 % alloc():
    436 %       proc.io->in_use = true, __ATOMIC_ACQUIRE
    437 %       if cltr.io.flag || !proc.io || proc.io->flag:
    438 %               return alloc_slow(cltr.io, proc.io)
    439 
    440 %       a = alloc_fast(proc.io)
    441 %       if a:
    442 %               proc.io->in_use = false, __ATOMIC_RELEASE
    443 %               return a
    444 
    445 %       return alloc_slow(cltr.io)
    446 
    447 % alloc_fast()
    448 %       left = proc.io->submit_q.free.tail - proc.io->submit_q.free.head
    449 %       if num_entries - left < want:
    450 %               return None
    451 
    452 %       a = ready[head]
    453 %       head = head + 1, __ATOMIC_RELEASE
    454 
    455 % alloc_slow()
    456 %       cltr.io.flag = true, __ATOMIC_ACQUIRE
    457 %       while(proc.io && proc.io->in_use) pause;
    458 
    459 
    460 
    461 % submit(a):
    462 %       proc.io->in_use = true, __ATOMIC_ACQUIRE
    463 %       if cltr.io.flag || proc.io != alloc.io || proc.io->flag:
    464 %               return submit_slow(cltr.io)
    465 
    466 %       submit_fast(proc.io, a)
    467 %       proc.io->in_use = false, __ATOMIC_RELEASE
    468 
    469 % polling()
    470 %       loop:
    471 %               yield
    472 %               flush()
    473 %               io_uring_enter
    474 %               collect
    475 %               handle()
    476381
    477382\section{Interface}
    478383The last important part of the \io subsystem is its interface.
    479 There are multiple approaches that can be offered to programmers, each with advantages and disadvantages.
     384Multiple approaches can be offered to programmers, each with advantages and disadvantages.
    480385The new \io subsystem can replace the C runtime API or extend it, and in the later case, the interface can go from very similar to vastly different.
    481386The following sections discuss some useful options using @read@ as an example.
     
    489394The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    490395This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    491 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
     396It also offers a, presumably, well-known and familiar API that C programmers can simply continue to work with.
    492397However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    493398If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
    494 Since the gcc ecosystems does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
     399Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
    495400
    496401\subsection{Synchronous Extension}
     
    503408It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
    504409However, it has the advantage of implementation simplicity.
    505 Finally, there is a certain irony to using a blocking synchronous interfaces for a feature often referred to as ``non-blocking'' \io.
     410Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
    506411
    507412\subsection{Asynchronous Extension}
     
    531436This offers more flexibility to users wanting to fully utilize all of the @io_uring@ features.
    532437However, it is not the most user-friendly option.
    533 It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricting users to usages that are compatible with how \CFA internally uses @io_uring@.
     438It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
Note: See TracChangeset for help on using the changeset viewer.