Changeset 92538ab for doc/theses/thierry_delisle_PhD/thesis/text/io.tex
- Timestamp:
- Apr 10, 2022, 2:53:18 PM (4 years ago)
- Branches:
- ADT, ast-experimental, enum, master, pthread-emulation, qualifiedEnum
- Children:
- d8e2a09
- Parents:
- 4559b34 (diff), 6256891 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)links above to see all the changes relative to each parent. - File:
-
- 1 edited
-
doc/theses/thierry_delisle_PhD/thesis/text/io.tex (modified) (7 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/thierry_delisle_PhD/thesis/text/io.tex
r4559b34 r92538ab 1 1 \chapter{User Level \io} 2 2 As mentioned in Section~\ref{prev:io}, User-Level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations. 3 Different operating systems offer various forms of asynchronous operations and as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.3 Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system. 4 4 5 5 \section{Kernel Interface} … … 173 173 The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. 174 174 Beyond this limit, the length of the system call is the throughput limiting factor. 175 I concluded from early experiments that preparing submissions seems to take a bout as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}.175 I concluded from early experiments that preparing submissions seems to take at most as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. 176 176 Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. 177 177 Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two. 178 178 Since completions are sent to the instance where requests were submitted, all instances with pending operations must be polled continously 179 179 \footnote{As will be described in Chapter~\ref{practice}, this does not translate into constant cpu usage.}. 180 Note that once an operation completes, there is nothing that ties it to the @io_uring@ instance that handled it. 181 There is nothing preventing a new operation with, for example, the same file descriptors to a different @io_uring@ instance. 180 182 181 183 A complicating aspect of submission is @io_uring@'s support for chains of operations, where the completion of an operation triggers the submission of the next operation on the link. … … 198 200 The only added complexity is that the number of SQEs is fixed, which means allocation can fail. 199 201 200 Allocation failures need to be pushed up to therouting algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available.202 Allocation failures need to be pushed up to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. 201 203 Furthermore, the routing algorithm should block operations up-front if none of the instances have available SQEs. 202 204 … … 212 214 213 215 In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}. 214 In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a current submitter and a next submitter.216 In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a ``next submitter'' that guarentees everything that is missed by the current submitter is seen by the next one. 215 217 Indeed, as long as there is a ``next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request. 216 218 Once the system call is done, the submitter must also free SQEs so that the allocator can reused them. … … 221 223 If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. 222 224 A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled. 223 This design is especially convenient for reasons explained in Chapter~\ref{practice}.224 225 225 226 With this pool of instances approach, the big advantage is that it is fairly flexible. 226 227 It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. 227 It also can gracefully handle srunning out of ressources, SQEs or the kernel returning @EBUSY@.228 It also can gracefully handle running out of ressources, SQEs or the kernel returning @EBUSY@. 228 229 The down side to this is that many of the steps used for submitting need complex synchronization to work properly. 229 230 The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed. 230 231 The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused and handle the kernel returning @EBUSY@. 231 All this synchronization may have a significant cost and, compare to the next approach presented, this synchronization is entirely overhead.232 All this synchronization may have a significant cost and, compared to the next approach presented, this synchronization is entirely overhead. 232 233 233 234 \subsubsection{Private Instances} 234 235 Another approach is to simply create one ring instance per \gls{proc}. 235 This alleviate the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps.236 This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. 236 237 This is effectively the same requirement as using @thread_local@ variables. 237 238 Since SQEs that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit SQEs in allocation order … … 240 241 To remove this requirement, a \gls{thrd} would need the ability to ``yield to a specific \gls{proc}'', \ie, park with the promise that it will be run next on a specific \gls{proc}, the \gls{proc} attached to the correct ring.} 241 242 , greatly simplifying both allocation and submission. 242 In this design, allocation and submission form a ringpartitionned ring buffer as shown in Figure~\ref{fig:pring}.243 In this design, allocation and submission form a partitionned ring buffer as shown in Figure~\ref{fig:pring}. 243 244 Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to do the system call. 244 Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of threads\glspl{thrd}, etc.245 Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, etc. 245 246 246 247 \begin{figure} … … 329 330 \paragraph{Pending Allocations} can be more complicated to handle. 330 331 If the arbiter has available instances, the arbiter can attempt to directly hand over the instance and satisfy the request. 331 Otherwise 332 Otherwise it must hold onto the list of threads until SQEs are made available again. 333 This handling becomes that much more complex if pending allocation require more than one SQE, since the arbiter must make a decision between statisfying requests in FIFO ordering or satisfy requests for fewer SQEs first. 334 335 While this arbiter has the potential to solve many of the problems mentionned in above, it also introduces a significant amount of complexity. 336 Tracking which processors are borrowing which instances and which instances have SQEs available ends-up adding a significant synchronization prelude to any I/O operation. 337 Any submission must start with a handshake that pins the currently borrowed instance, if available. 338 An attempt to allocate is then made, but the arbiter can concurrently be attempting to allocate from the same instance from a different \gls{hthrd}. 339 Once the allocation is completed, the submission must still check that the instance is still burrowed before attempt to flush. 340 These extra synchronization steps end-up having a similar cost to the multiple shared instances approach. 341 Furthermore, if the number of instances does not match the number of processors actively submitting I/O, the system can fall into a state where instances are constantly being revoked and end-up cycling the processors, which leads to significant cache deterioration. 342 Because of these reasons, this approach, which sounds promising on paper, does not improve on the private instance approach in practice. 343 344 \subsubsection{Private Instances V2} 345 332 346 333 347
Note:
See TracChangeset
for help on using the changeset viewer.