Context Navigation

-              rf6fd42aa
+              r14533d4
 \subsection{Multiplexing \io: Submission}
+The submission side is the most complicated aspect of @io_uring@ and from the design decisions made in the submission side, the completion side effectively follows.
+While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted. The consequence of this is that how much parallelism can be used to prepare submissions for the next system call is limited. Beyond this limit, the length of the system call will be the throughput limiting factor. I have concluded from early experiments that preparing submissions seems to take about as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, or in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}.
+\subsubsection{Pool of Instances}
+One approach is to have multiple shared instances. \Glspl{thrd} attempting \io operations pick one of the available instances and submits operations to that instance. Since the completion will be sent to the same instance, all instances with pending operations must be polled continously\footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}. Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently. Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention. This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
+Allocation in this scheme can be handled fairly easily. Free @sqe@s, \ie, @sqe@s that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s. Allocation also requires no ordering guarantee as all free @sqe@s are interchangeable. This requires a simple concurrent bag. The only added complexity is that the number of @sqe@s is fixed, which means allocation can fail. This failure needs to be pushed up to the routing algorithm, \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without any available @sqe@s. Ideally, the routing algorithm would block operations up-front if none of the instances have available @sqe@s.
+The submission side is the most complicated aspect of @io_uring@ and the completion side effectively follows from the design decisions made in the submission side.
+While it is possible to do the first steps of submission in parallel, the duration of the system call scales with number of entries submitted. The consequence of this is that how much parallelism can be used to prepare submissions for the next system call is limited. Beyond this limit, the length of the system call will be the throughput limiting factor. I have concluded from early experiments that preparing submissions seems to take about as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two. Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously\footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}.
+\subsubsection{Shared Instances}
+One approach is to have multiple shared instances. \Glspl{thrd} attempting \io operations pick one of the available instances and submit operations to that instance. Since there is no coupling between \glspl{proc} and @io_uring@ instances in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently. Since @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: the synchronization needed to submit does not induce more contention than @io_uring@ already does and the scheme to route \io requests to specific @io_uring@ instances does not introduce contention. This second aspect has an oversized importance because it comes into play before the sharding of instances, and as such, all \glspl{hthrd} can contend on the routing algorithm.
+Allocation in this scheme can be handled fairly easily. Free @sqe@s, \ie, @sqe@s that aren't currently being used to represent a request, can be written to safely and have a field called @user_data@ which the kernel only reads to copy to @cqe@s. Allocation also requires no ordering guarantee as all free @sqe@s are interchangeable. This requires a simple concurrent bag. The only added complexity is that the number of @sqe@s is fixed, which means allocation can fail. This is made worst from the fact that @io_uring@ users can chain together operations, in which case all @sqe@s forming a chain must be allocated from the same instance.
+Allocation failures need to be pushed up to the routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient @sqe@s available. Furthermore, the routing algorithm should block operations up-front if none of the instances have available @sqe@s.
 Once an @sqe@ is allocated, \glspl{thrd} can fill them normally, they simply need to keep trac of the @sqe@ index and which instance it belongs to.
 …
 Finally, the completion side is much simpler since the @io_uring@ system call enforces a natural synchronization point. Polling simply needs to regularly do the system call, go through the produced @cqe@s and communicate the result back to the originating \glspl{thrd}. Since @cqe@s only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. If the submission side does not designate submitters, polling can also submit all @sqe@s as it is polling events.  A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled. This design is especially convinient for reasons explained in Chapter~\ref{practice}.
 With this pool of instances approach, the big advantage is that it is fairly flexible. It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handle running out of ressources, @sqe@s or the kernel returning @EBUSY@. The down side to this is that many of the steps used for submitting need complex synchronization to work properly. The routing and allocation algorithm needs to keep track of which ring instances have available @sqe@s, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for @sqe@s and handle @sqe@s being freed. The submission side needs to safely append @sqe@s to the ring buffer, make sure no @sqe@ is dropped or left pending forever, notify the allocation side when @sqe@s can be reused and handle the kernel returning @EBUSY@. Sharding the @io_uring@ instances should alleviate much of the contention caused by this, but all this synchronization may still have non-zero cost.
+With this pool of instances approach, the big advantage is that it is fairly flexible. It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handles running out of ressources, @sqe@s or the kernel returning @EBUSY@. The down side to this is that many of the steps used for submitting need complex synchronization to work properly. The routing and allocation algorithm needs to keep track of which ring instances have available @sqe@s, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for @sqe@s and handle @sqe@s being freed. The submission side needs to safely append @sqe@s to the ring buffer, make sure no @sqe@ is dropped or left pending forever, notify the allocation side when @sqe@s can be reused and handle the kernel returning @EBUSY@. All this synchronization may have a significant cost and, compare to the next approach presented, this synchronization is entirely overhead.
 \subsubsection{Private Instances}
 Another approach is to simply create one ring instance per \gls{proc}. This alleviate the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. This is effectively the same requirement as using @thread_local@ variables. Since @sqe@s that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit @sqe@s in allocation order\footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission. This requirement means that from the subsystem's point of view, the allocation and submission are sequential. To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring. This is not a current or planned feature of \CFA.}, greatly simplifying both allocation and submission. In this design, allocation and submission form a ring partitionned ring buffer as shown in Figure~\ref{fig:pring}. Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call. Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of threads \glspl{thrd}, etc.
+Another approach is to simply create one ring instance per \gls{proc}. This alleviate the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. This is effectively the same requirement as using @thread_local@ variables. Since @sqe@s that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit @sqe@s in allocation order\footnote{The actual requirement is that \glspl{thrd} cannot context switch between allocation and submission. This requirement means that from the subsystem's point of view, the allocation and submission are sequential. To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.}, greatly simplifying both allocation and submission. In this design, allocation and submission form a ring partitionned ring buffer as shown in Figure~\ref{fig:pring}. Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call. Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of threads \glspl{thrd}, etc.
 \begin{figure}
 …
 \end{figure}
+This approach has the advantage that it does not require much of the synchronization needed in the shared approach. This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly. Instances running out of @sqe@s cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed. Another problematic case is that \glspl{thrd} that do not park for long periods of time will delay the submission of any @sqe@ not already submitted. This issue is similar to fairness issues which schedulers that use work-stealing mentioned in the previous chapter.
+This approach has the advantage that it does not require much of the synchronization needed in the shared approach. This comes at the cost that \glspl{thrd} submitting \io operations have less flexibility, they cannot park or yield, and several exceptional cases are handled poorly. Instances running out of @sqe@s cannot run \glspl{thrd} wanting to do \io operations, in such a case the \gls{thrd} needs to be moved to a different \gls{proc}, the only current way of achieving this would be to @yield()@ hoping to be scheduled on a different \gls{proc}, which is not guaranteed.
+A more involved version of this approach can seem to solve most of these problems, using a pattern called \newterm{helping}. \Glspl{thrd} that which to submit \io operations but cannot do so, either because of an allocation failure or because they were migrate to a different \gls{proc} between allocation and submission, create an object representing what they wish to achieve and add it to a list somewhere. For this particular problem, one solution would be to have a list per \gls{proc} of submissions that could not be completed because the thread was moved, and another list, probably per cluster, of \glspl{thrd} that where unable to allocate enough @sqe@s. The problem with these ``solutions'' is that they are still bound by the strong coupling between \glspl{proc} and @io_uring@ instances. Imagine a simple case with two \glspl{thrd} on two \glspl{proc}, one \gls{thrd} submits an \io operation and then sets a flag, the other \gls{thrd} spins until the flag is set. If the first \gls{thrd} is preempted between allocation and submission and moves to the other \gls{proc}, the original \gls{proc} could start running the spinning \gls{thrd}. If this happens, the helping ``solution'' is for the \gls{thrd} wanting to submit is \io operation to added append an item to a list belonging to the \gls{proc} where the allocation was made. No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}. However, in this case, the \gls{proc} is unable to help because it is executing the spinning \gls{thrd} mentioned when first expression this case\footnote{This particular example is completely artificial, but in the presence of many more \glspl{thrd}, it is not impossible that this problem would arise ``in the wild'' and could be difficult for users to reliably detect and avoid.}. Once in this situation, the only escape is to interrupted the execution of the \gls{thrd}, either directly or due to regular preemption, only then can the \gls{proc} take the time to handle the pending request to help. Interrupting \glspl{thrd} for this purpose is far from desireable, the cost is significant and the situation may be hard to detect. However, a more subtle reason why interrupting the \gls{thrd} is not a satisfying solution comes from the fact that the \gls{proc} is not actually using the instance it is tied to. If it were to use it, then helping could be done as part of the usage. Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using. Therefore a more satisfying solution would be for the \gls{thrd} submitting the operation to simply notice that the instance is unused and simply go ahead and use it. This is the approach presented next.
+\subsubsection{Instance borrowing}
+Both of the approaches presented above have higly undesirable aspects that stem from too loose coupling or too tight coupling between @io_uring@ and \glspl{proc}. In the first approach, loose coupling meant that all operations have synchronization overhead that a tighter coupling can avoid. The second approach on the other hand suffers from tight coupling causing problems when the \gls{proc} do not benefit from the coupling. While \glspl{proc} are continously issuing \io operations tight coupling is valuable since it avoids synchronization costs. However, in unlikely failure cases or when \glspl{proc} are not making use of their instance, tight coupling is no longer advantageous. A compromise between these approaches would be to allow tight coupling but have the option to revoke this coupling dynamically when failure cases arise. I call this approach ``instance borrowing''\footnote{While it looks similar to work-sharing and work-stealing, I think it is different enough from either to warrant a different verb to avoid confusion.}.
+In this approach, each cluster owns a pool of @io_uring@ instances managed by an arbiter. When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. However, in doing so it ties to the instance to the \gls{proc} it is currently running on. This coupling is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io. This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at any given time, akin to the private instances approach. However, where it differs is that revocation from the arbiter means that this approach does not suffer from the potential deadlock described above.
+% Since private instances appear to work well in the easy case, an intersting option would be keep instances private while it is convenient but
+% Arbitration is needed in the following cases
+% \begin{enumerate}
+%       \item The instance does not have sufficient @sqe@s to satisfy the request.
+%       \item The current \gls{proc} does not currently hold an instance.
+%       \item The current \gls{proc} has the wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission.
+% \end{enumerate}
+% In all cases
+% Verbs of this design
+% Allocation: obtaining an sqe from which to fill in the io request, enforces the io instance to use since it must be the one which provided the sqe. Must interact with the arbiter if the instance does not have enough sqe for the allocation. (Typical allocation will ask for only one sqe, but chained sqe must be allocated from the same context so chains of sqe must be allocated in bulks)
+% Submition: simply adds the sqe(s) to some data structure to communicate that they are ready to go. This operation can't fail because there are as many spots in the submit buffer than there are sqes. Must interact with the arbiter only if the thread was moved between the allocation and the submission.
+% Flushing: Taking all the sqes that were submitted and making them visible to the kernel, also counting them in order to figure out what to_submit should be. Must be thread-safe with submission. Has to interact with the Arbiter if there are external submissions. Can't simply use a protected queue because adding to the array is not safe if the ring is still available for submitters. Flushing must therefore: check if there are external pending requests if so, ask the arbiter to flush otherwise use the fast flush operation.
+% Collect: Once the system call is done, it returns how many sqes were consumed by the system. These must be freed for allocation. Must interact with the arbiter to notify that things are now ready.
+% Handle: process all the produced cqe. No need to interact with any of the submission operations or the arbiter.
+% alloc():
+%       proc.io->in_use = true, __ATOMIC_ACQUIRE
+%       if cltr.io.flag || !proc.io || proc.io->flag:
+%               return alloc_slow(cltr.io, proc.io)
+%       a = alloc_fast(proc.io)
+%       if a:
+%               proc.io->in_use = false, __ATOMIC_RELEASE
+%               return a
+%       return alloc_slow(cltr.io)
+% alloc_fast()
+%       left = proc.io->submit_q.free.tail - proc.io->submit_q.free.head
+%       if num_entries - left < want:
+%               return None
+%       a = ready[head]
+%       head = head + 1, __ATOMIC_RELEASE
+% alloc_slow()
+%       cltr.io.flag = true, __ATOMIC_ACQUIRE
+%       while(proc.io && proc.io->in_use) pause;
+% submit(a):
+%       proc.io->in_use = true, __ATOMIC_ACQUIRE
+%       if cltr.io.flag || proc.io != alloc.io || proc.io->flag:
+%               return submit_slow(cltr.io)
+%       submit_fast(proc.io, a)
+%       proc.io->in_use = false, __ATOMIC_RELEASE
+% polling()
+%       loop:
+%               yield
+%               flush()
+%               io_uring_enter
+%               collect
+%               handle()
 \section{Interface}

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 14533d4

Legend:

doc/theses/thierry_delisle_PhD/thesis/text/io.tex

Download in other formats: