Context Navigation

← Previous Change
Next Change →

io.tex

Timestamp:

Dec 14, 2022, 12:23:42 PM (3 years ago)

Author:

caparson <caparson@…>

Branches:

ADT, ast-experimental, master

Children:

441a6a7

Parents:

7d9598d8 (diff), d8bdf13 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

File:

: 1 edited

doc/theses/thierry_delisle_PhD/thesis/text/io.tex (modified) (29 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

-              r7d9598d8
+              r2dcd80a
 \chapter{User Level \io}\label{userio}
 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations.
+I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources.
+When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}).
 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system.
 …
 This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block.
 There are three options to monitor file descriptors in Linux:\footnote{
+There are three options to monitor file descriptors (FD) in Linux:\footnote{
 For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}.
 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.}
 …
 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest.
 Hence, the array length must be as long as the largest FD currently of interest.
 On return, it outputs the set motified in place to identify which of the file descriptors changed state.
+On return, it outputs the set modified in-place to identify which of the file descriptors changed state.
 This destructive change means selecting in a loop requires re-initializing the array for each iteration.
 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
+Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.
 Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set.
 However, these changes are only reflected when the manager makes its next call to @select@.
 …
 However, all three of these I/O systems have limitations.
 The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations.
 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.
+Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed.
 For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@.
 Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors.
 …
 \subsection{POSIX asynchronous I/O (AIO)}
 An alternative to @O_NONBLOCK@ is the AIO interface.
+Its interface lets programmers enqueue operations to be performed asynchronously by the kernel.
+Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations.
+For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}.
+Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost.
+Leaving polling for completion, which is similar to the previous system calls.
+AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices.
+It also supports batching multiple operations in a single system call.
+AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
+For \io multiplexing, @aio_suspend@ is the best interface.
+However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
+Using AIO, programmers can enqueue operations which are to be performed
+asynchronously by the kernel.
+The kernel can communicate
+completions of these operations in three ways:
+it can spawn a new \gls{kthrd}; send a Linux signal; or
+userspace can poll for completion of one or more operations.
+Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution.
+Signals and their associated interrupt handlers can also lead to fairly complicated
+interactions between subsystems, and they have a non-trivial cost.
+This leaves a single option: polling for completion---this is similar to the previously discussed
+system calls.
+While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file
+descriptors can be regular files or block devices.
+AIO also supports batching multiple operations in a single system call.
+AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed.
+Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing.
+However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress.
 AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests.
 This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests.
 …
 A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces.
 Like AIO, it represents \io operations as entries added to a queue.
 But like @epoll@, new requests can be submitted, while a blocking call waiting for requests to complete, is already in progress.
 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ring from which programmers poll for completion.
+But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress.
+The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion.
 One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations.
 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operations like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
+In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.
 On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion.
 …
 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems.
 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms.
 As opposed to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
+Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.
 \subsection{Discussion}
 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.
 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.
 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
 For this project, I selected @io_uring@, in large parts because of its generality.
+These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete.
+All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically.
+For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.
+For this project, I selected @io_uring@, in large part because of its generality.
 While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem.
 \section{Event-Engine}
 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}.
 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}.
+In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}.
 The parked \ats are then rescheduled by the event engine once the desired operation has been completed.
 …
 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance.
 Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right).
 The submission ring contains entries, \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
 The completion ring contains entries, \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
+The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.
+The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.
 The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start;
 the completion ring contains entries for the completed I/O operation.
 …
         \centering
         \input{io_uring.pstex_t}
+        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.}
+        \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left).
+        While the completion ring contains plain data, the submission ring contains only references.
+        These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.}
         \label{fig:iouring}
 \end{figure}
 …
 Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations.
 Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated.
 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 and S1 has not been submitted.
+Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted.
 \item
 The kernel is notified of the change to the ring using the system call @io_uring_enter@.
 The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned.
 The @io_uring@ instance can be constructed so this step is not required, but this requires elevated privilege.% and an early version of @io_uring@ had additional restrictions.
+The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions.
 \end{enumerate}
 …
 When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring.
 Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation.
 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.
 The system call is only needed if the application wants to block waiting for operations to complete.
+The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring.
+@io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions.
 \end{sloppypar}
 …
 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time.
 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
+An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.
 Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads.
 In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}.
 To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user.
 Three kinds of operations that can need the \glspl{kthrd}:
+Three kinds of operations that can need the \glspl{kthrd} are:
 \paragraph{Operations using} @IOSQE_ASYNC@.
 …
 \paragraph{Bounded operations.}
 This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices.
 @io_uring@ must also take this reality into account by delegating operations on regular files and block devices.
+Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices.
 In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}.
 \paragraph{Unbounded operations that must be retried.}
 While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket.
 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.
+Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel.
 @io_uring@ maintains a separate pool for these operations.
 The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}.
+Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel.
 Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@.
 …
 however, the duration of the system call scales with the number of entries submitted.
 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited.
 Beyond this limit, the length of the system call is the throughput limiting factor.
+Beyond this limit, the length of the system call is the throughput-limiting factor.
 I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.
 Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances.
 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{
+As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}.
+Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it.
+Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance.
+As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}.
+Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance.
 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link.
 SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}).
 The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission.
 Supporting chains is not a requirement of the \io subsystem, but it is still valuable.
+For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable.
 Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission.
 …
 To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.}
 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both.
 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.
+In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}.
 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call.
 Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc.
 …
         \centering
         \input{pivot_ring.pstex_t}
         \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.
+        \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition.
         When submitting, the partition is advanced.
         The kernel considers the partition as the head of the ring.}
 …
 However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly.
 Instances running out of SQEs cannot run \ats wanting to do \io operations.
 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed.
+In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur.
 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}.
 \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
+\Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.
 While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request.
 …
 In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made.
 No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}.
+However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock.
+However, the \io \gls{proc} is unable to help because it is executing the spinning \at.
+This results in a deadlock.
 While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''.
 Furthermore, this pattern is difficult to reliably detect and avoid.
 …
 \subsubsection{Public Instances}
 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling.
 \ats attempting an \io operation pick one of the available instances and submit the operation to that instance.
+\Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance.
 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently.
 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects:
 …
 \item
 The scheme to route \io requests to specific @io_uring@ instances does not introduce contention.
 This aspect has oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
+This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
 \end{itemize}
 Allocation in this scheme is fairly easy.
 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.
+Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs.
 Allocation also does not require ordering guarantees as all free SQEs are interchangeable.
 The only added complexity is that the number of SQEs is fixed, which means allocation can fail.
 …
 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}.
 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events.
 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
 With the pool of SQE instances approach, the big advantage is that it is fairly flexible.
+A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.
+The big advantage of the pool of SQE instances approach is that it is fairly flexible.
 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions.
 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@.
 …
 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed.
 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@.
 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.
+All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases.
 \subsubsection{Instance borrowing}
 Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}.
 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.
 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.
+The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling.
+The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids.
 When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs.
 However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous.
 …
 While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.}
+As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm.
 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}.
 When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance.
+When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance.
 This instance is now bound to the \gls{proc} the \at is running on.
 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state.
 …
         \item The current \gls{proc} does not hold an instance.
         \item The current instance does not have sufficient SQEs to satisfy the request.
+        \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}.
+        \item The current \gls{proc} has a wrong instance.
+        This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}.
 \end{enumerate}
 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{
 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}
+Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.}
 A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag.
 If not, it proceeds, otherwise it delegates the operation to the arbiter.
 …
 However, there is no need to immediately revoke the instance.
 External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed.
 This means whoever is responsible for the system call, first checks if the instance has any external submissions.
+This means whoever is responsible for the system call first checks whether the instance has any external submissions.
 If so, it asks the arbiter to revoke the instance and add the external submissions to the ring.
 …
 \section{Interface}
 The last important part of the \io subsystem is its interface.
+The final part of the \io subsystem is its interface.
 Multiple approaches can be offered to programmers, each with advantages and disadvantages.
 The new \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
 The following sections discuss some useful options using @read@ as an example.
+The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.
+The following sections discuss some useful options, using @read@ as an example.
 The standard Linux interface for C is:
 \begin{cfa}
 …
 \subsection{Replacement}
 Replacing the C \glsxtrshort{api} is the more intrusive and draconian approach.
 The goal is to convince the compiler and linker to replace any calls to @read@ to direct them to the \CFA implementation instead of glibc's.
 This rerouting has the advantage of working transparently and supporting existing binaries without needing recompilation.
 It also offers a, presumably, well known and familiar API that C programmers can simply continue to work with.
 However, this approach also entails a plethora of subtle technical challenges, which generally boils down to making a perfect replacement.
 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.
 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.
+Replacing the C \io subsystem is the more intrusive and draconian approach.
+The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's.
+This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation.
+It also offers a presumably well known and familiar API that C programmers can simply continue to work with.
+%However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement.
+However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs.
+This approach was rejected as being laudable but infeasible.
 \subsection{Synchronous Extension}
 Another interface option is to offer an interface different in name only.
+In this approach, an alternative call is created for each supported system calls.
 For example:
 \begin{cfa}
 ssize_t cfa_read(int fd, void *buf, size_t count);
 \end{cfa}
+The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking.
 This approach is feasible and still familiar to C programmers.
 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.
+It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries.
 However, it has the advantage of implementation simplicity.
 Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io.
 …
 future(ssize_t) read(int fd, void *buf, size_t count);
 \end{cfa}
 where the generic @future@ is fulfilled when the read completes and it contains the number of bytes read, which may be less than the number of bytes requested.
+where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested.
 The data read is placed in @buf@.
 The problem is that both the bytes read and data form the synchronization object, not just the bytes read.
 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.
+The problem is that both the bytes count and data form the synchronization object, not just the bytes read.
+Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this.
 A classical asynchronous API is:
 \begin{cfa}
 …
 However, it is not the most user-friendly option.
 It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@.
+As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension:
+\begin{cfa}
+ssize_t cfa_read(int fd, void *buf, size_t count);
+future(ssize_t) async_read(int fd, void *buf, size_t count);
+\end{cfa}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 2dcd80a for doc/theses/thierry_delisle_PhD/thesis/text/io.tex

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

Download in other formats: