Ignore:
Timestamp:
Dec 14, 2022, 12:23:42 PM (3 years ago)
Author:
caparson <caparson@…>
Branches:
ADT, ast-experimental, master
Children:
441a6a7
Parents:
7d9598d8 (diff), d8bdf13 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r7d9598d8 r2dcd80a  
    11\chapter{User Level \io}\label{userio}
    22As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
     3I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources.
     4When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}).
    35Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
    46
     
    1618This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
    1719
    18 There are three options to monitor file descriptors in Linux:\footnote{
     20There are three options to monitor file descriptors (FD) in Linux:\footnote{
    1921For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
    2022The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
     
    2527\paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
    2628Hence, the array length must be as long as the largest FD currently of interest.
    27 On return, it outputs the set motified in place to identify which of the file descriptors changed state.
     29On return, it outputs the set modified in-place to identify which of the file descriptors changed state.
    2830This destructive change means selecting in a loop requires re-initializing the array for each iteration.
    29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
     31Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
    3032Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
    3133However, these changes are only reflected when the manager makes its next call to @select@.
     
    4648However, all three of these I/O systems have limitations.
    4749The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
    48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.
     50Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed.
    4951For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@.
    5052Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
     
    5254\subsection{POSIX asynchronous I/O (AIO)}
    5355An alternative to @O_NONBLOCK@ is the AIO interface.
    54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
    55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
    56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
    57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
    58 Leaving polling for completion, which is similar to the previous system calls.
    59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
    60 It also supports batching multiple operations in a single system call.
    61 
    62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
    63 For \io multiplexing, @aio_suspend@ is the best interface.
    64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
     56Using AIO, programmers can enqueue operations which are to be performed
     57asynchronously by the kernel.
     58The kernel can communicate
     59completions of these operations in three ways:
     60it can spawn a new \gls{kthrd}; send a Linux signal; or
     61userspace can poll for completion of one or more operations.
     62Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution.
     63Signals and their associated interrupt handlers can also lead to fairly complicated
     64interactions between subsystems, and they have a non-trivial cost.
     65This leaves a single option: polling for completion---this is similar to the previously discussed
     66system calls.
     67While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file
     68descriptors can be regular files or block devices.
     69AIO also supports batching multiple operations in a single system call.
     70
     71AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
     72Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing.
     73However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
    6574AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
    6675This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
     
    92101A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
    93102Like AIO, it represents \io operations as entries added to a queue.
    94 But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
    95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
     103But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
     104The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion.
    96105
    97106One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
    98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
     107In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
    99108
    100109On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
     
    106115This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
    107116This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
    108 As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
     117Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
    109118
    110119\subsection{Discussion}
    111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
    112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
    113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
    114 
    115 For this project, I selected @io_uring@, in large parts because of its generality.
     120These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete.
     121All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically.
     122For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
     123
     124For this project, I selected @io_uring@, in large part because of its generality.
    116125While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
    117126
    118127\section{Event-Engine}
    119128An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
    120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
     129In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}.
    121130The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
    122131
     
    125134Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
    126135Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
    127 The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
    128 The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
     136The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
     137The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
    129138The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
    130139the completion ring contains entries for the completed I/O operation.
     
    134143        \centering
    135144        \input{io_uring.pstex_t}
    136         \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
     145        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left).
     146        While the completion ring contains plain data, the submission ring contains only references.
     147        These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.}
    137148        \label{fig:iouring}
    138149\end{figure}
     
    153164Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
    154165Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
    155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
     166Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted.
    156167\item
    157168The kernel is notified of the change to the ring using the system call @io_uring_enter@.
    158169The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
    159 The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
     170The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions.
    160171\end{enumerate}
    161172
     
    165176When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
    166177Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
    167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
    168 The system call is only needed if the application wants to block waiting for operations to complete.
     178The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring.
     179@io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions.
    169180\end{sloppypar}
    170181
     
    179190This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
    180191
    181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
     192An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
    182193Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
    183194In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
    184195To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
    185 Three kinds of operations that can need the \glspl{kthrd}:
     196Three kinds of operations that can need the \glspl{kthrd} are:
    186197
    187198\paragraph{Operations using} @IOSQE_ASYNC@.
     
    190201\paragraph{Bounded operations.}
    191202This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
    192 @io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
     203Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices.
    193204In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
    194205
    195206\paragraph{Unbounded operations that must be retried.}
    196207While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
    197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
     208Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel.
    198209@io_uring@ maintains a separate pool for these operations.
    199210The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
     211Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel.
    200212Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
    201213
     
    212224however, the duration of the system call scales with the number of entries submitted.
    213225The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
    214 Beyond this limit, the length of the system call is the throughput limiting factor.
     226Beyond this limit, the length of the system call is the throughput-limiting factor.
    215227I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
    216228Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
    217229Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
    218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
    219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
    220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance.
     230As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}.
     231Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance.
    221232
    222233A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
    223234SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
    224235The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
    225 Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
     236For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable.
    226237Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
    227238
     
    237248To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
    238249From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
    239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
     250In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}.
    240251Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
    241252Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
     
    244255        \centering
    245256        \input{pivot_ring.pstex_t}
    246         \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.
     257        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition.
    247258        When submitting, the partition is advanced.
    248259        The kernel considers the partition as the head of the ring.}
     
    253264However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
    254265Instances running out of SQEs cannot run \ats wanting to do \io operations.
    255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
     266In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur.
    256267
    257268A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
    258 \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
     269\Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
    259270While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
    260271
     
    263274In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
    264275No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
    265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
     276However, the \io \gls{proc} is unable to help because it is executing the spinning \at.
     277This results in a deadlock.
    266278While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
    267279Furthermore, this pattern is difficult to reliably detect and avoid.
     
    274286\subsubsection{Public Instances}
    275287The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
    276 \ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
     288\Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance.
    277289Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
    278290Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
     
    282294\item
    283295The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
    284 This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
     296This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
    285297\end{itemize}
    286298
    287299Allocation in this scheme is fairly easy.
    288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.
     300Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs.
    289301Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
    290302The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
     
    312324Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
    313325If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
    314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
    315 
    316 With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
     326A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
     327
     328The big advantage of the pool of SQE instances approach is that it is fairly flexible.
    317329It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
    318330It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
     
    320332The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
    321333The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
    322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
     334All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases.
    323335
    324336\subsubsection{Instance borrowing}
    325337Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
    326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
    327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.
     338The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling.
     339The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids.
    328340When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
    329341However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
     
    332344While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
    333345
     346As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm.
    334347In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
    335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
     348When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance.
    336349This instance is now bound to the \gls{proc} the \at is running on.
    337350This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
     
    343356        \item The current \gls{proc} does not hold an instance.
    344357        \item The current instance does not have sufficient SQEs to satisfy the request.
    345         \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
     358        \item The current \gls{proc} has a wrong instance.
     359        This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}.
    346360\end{enumerate}
    347361However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
    348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}
     362Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.}
    349363A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
    350364If not, it proceeds, otherwise it delegates the operation to the arbiter.
     
    365379However, there is no need to immediately revoke the instance.
    366380External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
    367 This means whoever is responsible for the system call, first checks if the instance has any external submissions.
     381This means whoever is responsible for the system call first checks whether the instance has any external submissions.
    368382If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
    369383
     
    382396
    383397\section{Interface}
    384 The last important part of the \io subsystem is its interface.
     398The final part of the \io subsystem is its interface.
    385399Multiple approaches can be offered to programmers, each with advantages and disadvantages.
    386 The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
    387 The following sections discuss some useful options using @read@ as an example.
     400The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
     401The following sections discuss some useful options, using @read@ as an example.
    388402The standard Linux interface for C is:
    389403\begin{cfa}
     
    392406
    393407\subsection{Replacement}
    394 Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
    395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
    396 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
    397 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
    398 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
    399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
    400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
     408Replacing the C \io subsystem is the more intrusive and draconian approach.
     409The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's.
     410This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation.
     411It also offers a presumably well known and familiar API that C programmers can simply continue to work with.
     412%However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement.
     413However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs.
     414This approach was rejected as being laudable but infeasible.
    401415
    402416\subsection{Synchronous Extension}
    403417Another interface option is to offer an interface different in name only.
     418In this approach, an alternative call is created for each supported system calls.
    404419For example:
    405420\begin{cfa}
    406421ssize_t cfa_read(int fd, void *buf, size_t count);
    407422\end{cfa}
     423The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking.
     424
    408425This approach is feasible and still familiar to C programmers.
    409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
     426It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries.
    410427However, it has the advantage of implementation simplicity.
    411428Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
     
    416433future(ssize_t) read(int fd, void *buf, size_t count);
    417434\end{cfa}
    418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
     435where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested.
    419436The data read is placed in @buf@.
    420 The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
    421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
     437The problem is that both the bytes count and data form the synchronization object, not just the bytes read.
     438Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this.
    422439A classical asynchronous API is:
    423440\begin{cfa}
     
    438455However, it is not the most user-friendly option.
    439456It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
     457
     458As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension:
     459\begin{cfa}
     460ssize_t cfa_read(int fd, void *buf, size_t count);
     461future(ssize_t) async_read(int fd, void *buf, size_t count);
     462\end{cfa}
Note: See TracChangeset for help on using the changeset viewer.