Changeset 2dcd80a for doc/theses/thierry_delisle_PhD/thesis/text/io.tex
- Timestamp:
- Dec 14, 2022, 12:23:42 PM (3 years ago)
- Branches:
- ADT, ast-experimental, master
- Children:
- 441a6a7
- Parents:
- 7d9598d8 (diff), d8bdf13 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r7d9598d8 r2dcd80a 1 1 \chapter{User Level \io}\label{userio} 2 2 As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. 3 I/O operations, among others, generally block the \gls{kthrd} when the operation needs to wait for unavailable resources. 4 When using \gls{uthrding}, this results in the \proc blocking rather than the \at, hindering parallelism and potentially causing deadlocks (see Chapter~\ref{prev:io}). 3 5 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating system. 4 6 … … 16 18 This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. 17 19 18 There are three options to monitor file descriptors in Linux:\footnote{20 There are three options to monitor file descriptors (FD) in Linux:\footnote{ 19 21 For simplicity, this section omits \lstinline{pselect} and \lstinline{ppoll}. 20 22 The difference between these system calls and \lstinline{select} and \lstinline{poll}, respectively, is not relevant for this discussion.} … … 25 27 \paragraph{\lstinline{select}} is the oldest of these options, and takes as input a contiguous array of bits, where each bit represents a file descriptor of interest. 26 28 Hence, the array length must be as long as the largest FD currently of interest. 27 On return, it outputs the set mo tified inplace to identify which of the file descriptors changed state.29 On return, it outputs the set modified in-place to identify which of the file descriptors changed state. 28 30 This destructive change means selecting in a loop requires re-initializing the array for each iteration. 29 Another limit of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent.31 Another limitation of @select@ is that calls from different \glspl{kthrd} sharing FDs are independent. 30 32 Hence, if one \gls{kthrd} is managing the select calls, other threads can only add/remove to/from the manager's interest set through synchronized calls to update the interest set. 31 33 However, these changes are only reflected when the manager makes its next call to @select@. … … 46 48 However, all three of these I/O systems have limitations. 47 49 The @man@ page for @O_NONBLOCK@ mentions that ``[@O_NONBLOCK@] has no effect for regular files and block devices'', which means none of these three system calls are viable multiplexing strategies for these types of \io operations. 48 Furthermore, TTYs can also be tricky to use since they can take different forms based on how the command is executed.50 Furthermore, TTYs (FDs connect to a standard input and output) can also be tricky to use since they can take different forms based on how the command is executed. 49 51 For example, @epoll@ rejects FDs pointing to regular files or block devices, which includes @stdin@ when using shell redirections~\cite[\S~3.6]{MAN:bash}, but does not reject shell pipelines~\cite[\S~3.2.3]{MAN:bash}, which includes pipelines into @stdin@. 50 52 Finally, none of these are useful solutions for multiplexing \io operations that do not have a corresponding file descriptor and can be awkward for operations using multiple file descriptors. … … 52 54 \subsection{POSIX asynchronous I/O (AIO)} 53 55 An alternative to @O_NONBLOCK@ is the AIO interface. 54 Its interface lets programmers enqueue operations to be performed asynchronously by the kernel. 55 Completions of these operations can be communicated in various ways: either by spawning a new \gls{kthrd}, sending a Linux signal, or polling for completion of one or more operations. 56 For this work, spawning a new \gls{kthrd} is counterproductive but a related solution is discussed in Section~\ref{io:morethreads}. 57 Using interrupt handlers can also lead to fairly complicated interactions between subsystems and has a non-trivial cost. 58 Leaving polling for completion, which is similar to the previous system calls. 59 AIO only supports read and write operations to file descriptors, it does not have the same limitation as @O_NONBLOCK@, \ie, the file descriptors can be regular files and blocked devices. 60 It also supports batching multiple operations in a single system call. 61 62 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, and @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 63 For \io multiplexing, @aio_suspend@ is the best interface. 64 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@, \ie, the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 56 Using AIO, programmers can enqueue operations which are to be performed 57 asynchronously by the kernel. 58 The kernel can communicate 59 completions of these operations in three ways: 60 it can spawn a new \gls{kthrd}; send a Linux signal; or 61 userspace can poll for completion of one or more operations. 62 Spawning a new \gls{kthrd} is not consistent with working at the user-level thread level, but Section~\ref{io:morethreads} discusses a related solution. 63 Signals and their associated interrupt handlers can also lead to fairly complicated 64 interactions between subsystems, and they have a non-trivial cost. 65 This leaves a single option: polling for completion---this is similar to the previously discussed 66 system calls. 67 While AIO only supports read and write operations to file descriptors; it does not have the same limitations as @O_NONBLOCK@, \ie, the file 68 descriptors can be regular files or block devices. 69 AIO also supports batching multiple operations in a single system call. 70 71 AIO offers two different approaches to polling: @aio_error@ can be used as a spinning form of polling, returning @EINPROGRESS@ until the operation is completed, while @aio_suspend@ can be used similarly to @select@, @poll@ or @epoll@, to wait until one or more requests have been completed. 72 Asynchronous interfaces normally handle more of the complexity than retry-based interfaces, which is convenient for \io multiplexing. 73 However, even if AIO requests can be submitted concurrently, @aio_suspend@ suffers from the same limitation as @select@ and @poll@: the interest set cannot be dynamically changed while a call to @aio_suspend@ is in progress. 65 74 AIO also suffers from the limitation of specifying which requests have been completed, \ie programmers have to poll each request in the interest set using @aio_error@ to identify the completed requests. 66 75 This limitation means that, like @select@ and @poll@ but not @epoll@, the time needed to examine polling results increases based on the total number of requests monitored, not the number of completed requests. … … 92 101 A very recent addition to Linux, @io_uring@~\cite{MAN:io_uring}, is a framework that aims to solve many of the problems listed in the above interfaces. 93 102 Like AIO, it represents \io operations as entries added to a queue. 94 But like @epoll@, new requests can be submitted , while a blocking call waiting for requests to complete,is already in progress.95 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring to which programmers push \io requests and a completion ringfrom which programmers poll for completion.103 But like @epoll@, new requests can be submitted while a blocking call waiting for requests to complete is already in progress. 104 The @io_uring@ interface uses two ring buffers (referred to simply as rings) at its core: a submit ring, to which programmers push \io requests, and a completion ring, from which programmers poll for completion. 96 105 97 106 One of the big advantages over the prior interfaces is that @io_uring@ also supports a much wider range of operations. 98 In addition to supporting reads and writes to any file descriptor like AIO, it supports other operationslike @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc.107 In addition to supporting reads and writes to any file descriptor like AIO, it also supports other operations, like @open@, @close@, @fsync@, @accept@, @connect@, @send@, @recv@, @splice@, \etc. 99 108 100 109 On top of these, @io_uring@ adds many extras like avoiding copies between the kernel and user space using shared memory, allowing different mechanisms to communicate with device drivers, and supporting chains of requests, \ie, requests that automatically trigger follow-up requests on completion. … … 106 115 This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. 107 116 This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. 108 As opposedto C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking.117 Contrast this to C, which has a very limited standard \glsxtrshort{api} for \io, \eg, the C standard library has no networking. 109 118 110 119 \subsection{Discussion} 111 These options effectively fall into two broad camps: waiting for \io to be ready versus waiting for \io to complete.112 All operating systems that support asynchronous \io must offer an interface along one of these lines, but the details vary drastically.113 For example, Free BSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of use improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@.114 115 For this project, I selected @io_uring@, in large part sbecause of its generality.120 These options effectively fall into two broad camps: waiting for \io to be ready, versus waiting for \io to complete. 121 All operating systems that support asynchronous \io must offer an interface along at least one of these lines, but the details vary drastically. 122 For example, FreeBSD offers @kqueue@~\cite{MAN:bsd/kqueue}, which behaves similarly to @epoll@, but with some small quality of life improvements, while Windows (Win32)~\cite{win:overlap} offers ``overlapped I/O'', which handles submissions similarly to @O_NONBLOCK@ with extra flags on the synchronous system call, but waits for completion events, similarly to @io_uring@. 123 124 For this project, I selected @io_uring@, in large part because of its generality. 116 125 While @epoll@ has been shown to be a good solution for socket \io (\cite{Karsten20}), @io_uring@'s transparent support for files, pipes, and more complex operations, like @splice@ and @tee@, make it a better choice as the foundation for a general \io subsystem. 117 126 118 127 \section{Event-Engine} 119 128 An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. 120 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returningcontrol to the \gls{proc}.129 In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, and then returns control to the \gls{proc}. 121 130 The parked \ats are then rescheduled by the event engine once the desired operation has been completed. 122 131 … … 125 134 Figure~\ref{fig:iouring} shows an overview of an @io_uring@ instance. 126 135 Two ring buffers are used to communicate with the kernel: one for submissions~(left) and one for completions~(right). 127 The submission ring contains entries,\newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel.128 The completion ring contains entries,\newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application.136 The submission ring contains \newterm{Submit Queue Entries} (SQE), produced (appended) by the application when an operation starts and then consumed by the kernel. 137 The completion ring contains \newterm{Completion Queue Entries} (CQE), produced (appended) by the kernel when an operation completes and then consumed by the application. 129 138 The submission ring contains indexes into the SQE array (denoted \emph{S} in the figure) containing entries describing the I/O operation to start; 130 139 the completion ring contains entries for the completed I/O operation. … … 134 143 \centering 135 144 \input{io_uring.pstex_t} 136 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). The submission ring indexes into a pre-allocated array (denoted \emph{S}) instead.} 145 \caption[Overview of \lstinline{io_uring}]{Overview of \lstinline{io_uring} \smallskip\newline Two ring buffers are used to communicate with the kernel, one for completions~(right) and one for submissions~(left). 146 While the completion ring contains plain data, the submission ring contains only references. 147 These references are indexes into an array (denoted \emph{S}), which is created at the same time as the two rings and is also readable by the kernel.} 137 148 \label{fig:iouring} 138 149 \end{figure} … … 153 164 Since the head is visible to the kernel, some memory barriers may be required to prevent the compiler from reordering these operations. 154 165 Since the submission ring is a regular ring buffer, more than one SQE can be added at once and the head is updated only after all entries are updated. 155 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2 andS1 has not been submitted.166 Note, SQE can be filled and submitted in any order, \eg in Figure~\ref{fig:iouring} the submission order is S0, S3, S2. S1 has not been submitted. 156 167 \item 157 168 The kernel is notified of the change to the ring using the system call @io_uring_enter@. 158 169 The number of elements appended to the submission ring is passed as a parameter and the number of elements consumed is returned. 159 The @io_uring@ instance can be constructed so this step is not required, but this requireselevated privilege.% and an early version of @io_uring@ had additional restrictions.170 The @io_uring@ instance can be constructed so this step is not required, but this feature requires that the process have elevated privilege.% and an early version of @io_uring@ had additional restrictions. 160 171 \end{enumerate} 161 172 … … 165 176 When operations do complete, the kernel appends a CQE to the completion ring and advances the head of the ring. 166 177 Each CQE contains the result of the operation as well as a copy of the @user_data@ field of the SQE that triggered the operation. 167 It is not necessary to call @io_uring_enter@ to get new events because the kernel can directly modify the completion ring.168 The system call is only needed if the application wants to block waiting for operations to complete.178 The @io_uring_enter@ system call is only needed if the application wants to block waiting for operations to complete or to flush the submission ring. 179 @io_uring@ supports option @IORING_SETUP_SQPOLL@ at creation, which can remove the need for the system call for submissions. 169 180 \end{sloppypar} 170 181 … … 179 190 This restriction means \io request bursts may have to be subdivided and submitted in chunks at a later time. 180 191 181 An important detail to keep in mind is that just like ``The cloud is just someone else's computer'' \cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads.192 An important detail to keep in mind is that just like ``The cloud is just someone else's computer''~\cite{xkcd:cloud}, asynchronous operations are just operations using someone else's threads. 182 193 Indeed, asynchronous operations can require computation time to complete, which means that if this time is not taken from the thread that triggered the asynchronous operation, it must be taken from some other threads. 183 194 In this case, the @io_uring@ operations that cannot be handled directly in the system call must be delegated to some other \gls{kthrd}. 184 195 To this end, @io_uring@ maintains multiple \glspl{kthrd} inside the kernel that are not exposed to the user. 185 Three kinds of operations that can need the \glspl{kthrd} :196 Three kinds of operations that can need the \glspl{kthrd} are: 186 197 187 198 \paragraph{Operations using} @IOSQE_ASYNC@. … … 190 201 \paragraph{Bounded operations.} 191 202 This is also a fairly simple case. As mentioned earlier in this chapter, [@O_NONBLOCK@] has no effect for regular files and block devices. 192 @io_uring@ must also take this reality into accountby delegating operations on regular files and block devices.203 Therefore, @io_uring@ handles this case by delegating operations on regular files and block devices. 193 204 In fact, @io_uring@ maintains a pool of \glspl{kthrd} dedicated to these operations, which are referred to as \newterm{bounded workers}. 194 205 195 206 \paragraph{Unbounded operations that must be retried.} 196 207 While operations like reads on sockets can return @EAGAIN@ instead of blocking the \gls{kthrd}, in the case these operations return @EAGAIN@ they must be retried by @io_uring@ once the data is available on the socket. 197 Since this retry cannot necessarily be done in the system call, @io_uring@ must delegate these calls to a \gls{kthrd}.208 Since this retry cannot necessarily be done in the system call, \ie, using the application's \gls{kthrd}, @io_uring@ must delegate these calls to \glspl{kthrd} in the kernel. 198 209 @io_uring@ maintains a separate pool for these operations. 199 210 The \glspl{kthrd} in this pool are referred to as \newterm{unbounded workers}. 211 Once unbounded operations are ready to be retried, one of the workers is woken up and it will handle the retry inside the kernel. 200 212 Unbounded workers are also responsible for handling operations using @IOSQE_ASYNC@. 201 213 … … 212 224 however, the duration of the system call scales with the number of entries submitted. 213 225 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. 214 Beyond this limit, the length of the system call is the throughput 226 Beyond this limit, the length of the system call is the throughput-limiting factor. 215 227 I concluded from early experiments that preparing submissions seems to take almost as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. 216 228 Therefore, the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. 217 229 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continuously\footnote{ 218 As described in Chapter~\ref{practice}, this does not translate into constant CPU usage.}. 219 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it. 220 Nothing preventing a new operation, with for example the same file descriptor, to use a different @io_uring@ instance. 230 As described in Chapter~\ref{practice}, this does not translate into high CPU usage.}. 231 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it --- nothing prevents a new operation, with for example the same file descriptor, from using a different @io_uring@ instance. 221 232 222 233 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link. 223 234 SQEs forming a chain must be allocated from the same instance and must be contiguous in the Submission Ring (see Figure~\ref{fig:iouring}). 224 235 The consequence of this feature is that filling SQEs can be arbitrarily complex, and therefore, users may need to run arbitrary code between allocation and submission. 225 Supporting chains is not a requirement of the\io subsystem, but it is still valuable.236 For this work, supporting chains is not a requirement of the \CFA \io subsystem, but it is still valuable. 226 237 Support for this feature can be fulfilled simply by supporting arbitrary user code between allocation and submission. 227 238 … … 237 248 To remove this requirement, a \at needs the ability to ``yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} 238 249 From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. 239 In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}.250 In this design, allocation and submission form a partitioned ring buffer, as shown in Figure~\ref{fig:pring}. 240 251 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regard to when to perform the system call. 241 252 Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. … … 244 255 \centering 245 256 \input{pivot_ring.pstex_t} 246 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated sqes are appended to the first partition.257 \caption[Partitioned ring buffer]{Partitioned ring buffer \smallskip\newline Allocated SQEs are appended to the first partition. 247 258 When submitting, the partition is advanced. 248 259 The kernel considers the partition as the head of the ring.} … … 253 264 However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. 254 265 Instances running out of SQEs cannot run \ats wanting to do \io operations. 255 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed .266 In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed to ever occur. 256 267 257 268 A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. 258 \ atsthat cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster.269 \Glspl{at} that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. 259 270 While there is still a strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. 260 271 … … 263 274 In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. 264 275 No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. 265 However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. 276 However, the \io \gls{proc} is unable to help because it is executing the spinning \at. 277 This results in a deadlock. 266 278 While this example is artificial, in the presence of many \ats, this problem can arise ``in the wild''. 267 279 Furthermore, this pattern is difficult to reliably detect and avoid. … … 274 286 \subsubsection{Public Instances} 275 287 The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. 276 \ atsattempting an \io operation pick one of the available instances and submit the operation to that instance.288 \Glspl{at} attempting an \io operation pick one of the available instances and submit the operation to that instance. 277 289 Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. 278 290 Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: … … 282 294 \item 283 295 The scheme to route \io requests to specific @io_uring@ instances does not introduce contention. 284 This aspect has oversized importancebecause it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.296 This aspect is very important because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm. 285 297 \end{itemize} 286 298 287 299 Allocation in this scheme is fairly easy. 288 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely and have a field called @user_data@ that the kernel only reads to copy to CQEs.300 Free SQEs, \ie, SQEs that are not currently being used to represent a request, can be written-to safely, and have a field called @user_data@ that the kernel only reads to copy to CQEs. 289 301 Allocation also does not require ordering guarantees as all free SQEs are interchangeable. 290 302 The only added complexity is that the number of SQEs is fixed, which means allocation can fail. … … 312 324 Since CQEs only own a signed 32-bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. 313 325 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. 314 A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled.315 316 With the pool of SQE instances approach, the big advantageis that it is fairly flexible.326 A simple approach to polling is to allocate a user-level \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. 327 328 The big advantage of the pool of SQE instances approach is that it is fairly flexible. 317 329 It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. 318 330 It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. … … 320 332 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. 321 333 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. 322 Compared to the private-instance approach, all this synchronization has a significant cost and this synchronization is entirely overhead.334 All this synchronization has a significant cost, compared to the private-instance approach which does not have synchronization costs in most cases. 323 335 324 336 \subsubsection{Instance borrowing} 325 337 Both of the prior approaches have undesirable aspects that stem from tight or loose coupling between @io_uring@ and \glspl{proc}. 326 The first approach suffers from tight coupling causing problems when a \gls{proc} does not benefit from the coupling.327 The second approach suffers from loose couplings causing operations to have synchronization overhead, which tighter coupling avoids.338 The first approach suffers from tight coupling, causing problems when a \gls{proc} does not benefit from the coupling. 339 The second approach suffers from loose couplings, causing operations to have synchronization overhead, which tighter coupling avoids. 328 340 When \glspl{proc} are continuously issuing \io operations, tight coupling is valuable since it avoids synchronization costs. 329 341 However, in unlikely failure cases or when \glspl{proc} are not using their instances, tight coupling is no longer advantageous. … … 332 344 While instance borrowing looks similar to work sharing and stealing, I think it is different enough to warrant a different verb to avoid confusion.} 333 345 346 As mentioned later in this section, this approach is not ultimately used, but here is still an high-level outline of the algorithm. 334 347 In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. 335 When a \at attempts to issue an \io operation, it ask for an instance from the arbiterand issues requests to that instance.348 When a \at attempts to issue an \io operation, it asks for an instance from the arbiter, and issues requests to that instance. 336 349 This instance is now bound to the \gls{proc} the \at is running on. 337 350 This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial \io state. … … 343 356 \item The current \gls{proc} does not hold an instance. 344 357 \item The current instance does not have sufficient SQEs to satisfy the request. 345 \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. 358 \item The current \gls{proc} has a wrong instance. 359 This happens if the submitting \at context-switched between allocation and submission: \newterm{external submissions}. 346 360 \end{enumerate} 347 361 However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{ 348 Note the handshake is not lock-\emph{free} since it lacks the proper progress guarantee.}362 Note the handshake is not lock-\emph{free}~\cite{wiki:lockfree} since it lacks the proper progress guarantee.} 349 363 A \gls{proc} raises a local flag before using its borrowed instance and checks if the instance is marked as revoked or if the arbiter has raised its flag. 350 364 If not, it proceeds, otherwise it delegates the operation to the arbiter. … … 365 379 However, there is no need to immediately revoke the instance. 366 380 External submissions must simply be added to the ring before the next system call, \ie, when the submission ring is flushed. 367 This means whoever is responsible for the system call , first checks ifthe instance has any external submissions.381 This means whoever is responsible for the system call first checks whether the instance has any external submissions. 368 382 If so, it asks the arbiter to revoke the instance and add the external submissions to the ring. 369 383 … … 382 396 383 397 \section{Interface} 384 The last importantpart of the \io subsystem is its interface.398 The final part of the \io subsystem is its interface. 385 399 Multiple approaches can be offered to programmers, each with advantages and disadvantages. 386 The new \ io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different.387 The following sections discuss some useful options using @read@ as an example.400 The new \CFA \io subsystem can replace the C runtime API or extend it, and in the latter case, the interface can go from very similar to vastly different. 401 The following sections discuss some useful options, using @read@ as an example. 388 402 The standard Linux interface for C is: 389 403 \begin{cfa} … … 392 406 393 407 \subsection{Replacement} 394 Replacing the C \ glsxtrshort{api}is the more intrusive and draconian approach.395 The goal is to convince the compiler and linker to replace any calls to @read@ to direct themto the \CFA implementation instead of glibc's.396 This rerouting has the advantage of working transparently and supporting existing binaries without ne eding recompilation.397 It also offers a , presumably,well known and familiar API that C programmers can simply continue to work with.398 However, this approach also entails a plethora of subtle technical challenges, which generally boilsdown to making a perfect replacement.399 If the \CFA interface replaces only \emph{some} of the calls to glibc, then this can easily lead to esoteric concurrency bugs.400 Since the gcc ecosystem does not offer a scheme for perfect replacement, this approach was rejected as being laudable but infeasible.408 Replacing the C \io subsystem is the more intrusive and draconian approach. 409 The goal is to convince the compiler and linker to replace any calls to @read@ by calls to the \CFA implementation instead of glibc's. 410 This rerouting has the advantage of working transparently and supporting existing binaries without necessarily needing recompilation. 411 It also offers a presumably well known and familiar API that C programmers can simply continue to work with. 412 %However, this approach also entails a plethora of subtle technical challenges, which generally boil down to making a perfect replacement. 413 However, when using this approach, any and all calls to the C \io subsystem, since using a mix of the C and \CFA \io subsystems can easily lead to esoteric concurrency bugs. 414 This approach was rejected as being laudable but infeasible. 401 415 402 416 \subsection{Synchronous Extension} 403 417 Another interface option is to offer an interface different in name only. 418 In this approach, an alternative call is created for each supported system calls. 404 419 For example: 405 420 \begin{cfa} 406 421 ssize_t cfa_read(int fd, void *buf, size_t count); 407 422 \end{cfa} 423 The new @cfa_read@ would have the same interface behaviour and guarantee as the @read@ system call, but allow the runtime system to use user-level blocking instead of kernel-level blocking. 424 408 425 This approach is feasible and still familiar to C programmers. 409 It comes with the caveat that any code attempting to use it must be recompiled, which is a problem considering the amount of existing legacy C binaries.426 It comes with the caveat that any code attempting to use it must be modified, which is a problem considering the amount of existing legacy C binaries. 410 427 However, it has the advantage of implementation simplicity. 411 428 Finally, there is a certain irony to using a blocking synchronous interface for a feature often referred to as ``non-blocking'' \io. … … 416 433 future(ssize_t) read(int fd, void *buf, size_t count); 417 434 \end{cfa} 418 where the generic @future@ is fulfilled when the read completes and it contains the number of bytesread, which may be less than the number of bytes requested.435 where the generic @future@ is fulfilled when the read completes, with the count of bytes actually read, which may be less than the number of bytes requested. 419 436 The data read is placed in @buf@. 420 The problem is that both the bytes readand data form the synchronization object, not just the bytes read.421 Hence, the buffer cannot be reused until the operation completes but the synchronization does not cover the buffer.437 The problem is that both the bytes count and data form the synchronization object, not just the bytes read. 438 Hence, the buffer cannot be reused until the operation completes but the synchronization on the future does not enforce this. 422 439 A classical asynchronous API is: 423 440 \begin{cfa} … … 438 455 However, it is not the most user-friendly option. 439 456 It obviously imposes a strong dependency between user code and @io_uring@ but at the same time restricts users to usages that are compatible with how \CFA internally uses @io_uring@. 457 458 As of writting this document, \CFA offers both a synchronous extension and the first approach to the asynchronous extension: 459 \begin{cfa} 460 ssize_t cfa_read(int fd, void *buf, size_t count); 461 future(ssize_t) async_read(int fd, void *buf, size_t count); 462 \end{cfa}
Note:
See TracChangeset
for help on using the changeset viewer.