Changeset f2bc9fa

Ignore:
Timestamp:
Apr 4, 2022, 7:47:48 PM (3 months ago)
Branches:
enum, master
Children:
f134c25
Parents:
1a9592a
Message:

Small changes to io and intro section.

Location:
doc/theses/thierry_delisle_PhD/thesis
Files:
3 edited

Legend:

Unmodified
 r1a9592a Scheduling is the process of assigning resources to incomming requests. A very common form of this is assigning available workers to work-requests. The need for scheduling is very common in Computer Science, \eg Operating Systems and Hypervisors schedule available CPUs, NICs schedule available bamdwith, but it is also common in other fields. For example, assmebly lines are an example of scheduling where parts needed assembly are assigned to line workers. The need for scheduling is very common in Computer Science, \eg Operating Systems and Hypervisors schedule available CPUs, NICs schedule available bamdwith, but scheduling is also common in other fields. For example, in assmebly lines assigning parts in need of assembly to line workers is a form of scheduling. In all these cases, the choice of a scheduling algorithm generally depends first and formost on how much information is available to the scheduler. \section{Naming Convention} Scheduling has been studied by various different communities concentrating on different incarnation of the same problems. As a result, their is no real naming convention for scheduling that is respected across these communities. For this document, I will use the term \newterm{task} to refer to the abstract objects being scheduled and the term \newterm{worker} to refer to the objects which will execute these tasks. Scheduling has been studied by various different communities concentrating on different incarnation of the same problems. As a result, their is no real naming convention for scheduling that is respected across these communities. For this document, I will use the term \newterm{\Gls{at}} to refer to the abstract objects being scheduled and the term \newterm{\Gls{proc}} to refer to the objects which will execute these \glspl{at}. \section{Static Scheduling} Static schedulers require that tasks have their dependencies and costs explicitly and exhaustively specified prior schedule. Static schedulers require that \glspl{at} have their dependencies and costs explicitly and exhaustively specified prior schedule. The scheduler then processes this input ahead of time and producess a \newterm{schedule} to which the system can later adhere. This approach is generally popular in real-time systems since the need for strong guarantees justifies the cost of supplying this information. \section{Dynamic Scheduling} It may be difficult to fulfill the requirements of static scheduler if dependencies are conditionnal. In this case, it may be preferable to detect dependencies at runtime. This detection effectively takes the form of halting or suspending a task with unfulfilled dependencies and adding one or more new task(s) to the system. The new task(s) have the responsability of adding the dependent task back in the system once completed. As a consequence, the scheduler may have an incomplete view of the system, seeing only tasks we no pending dependencies. Schedulers that support this detection at runtime are referred to as \newterm{Dynamic Schedulers}. It may be difficult to fulfill the requirements of static scheduler if dependencies are conditionnal. In this case, it may be preferable to detect dependencies at runtime. This detection effectively takes the form of adding one or more new \gls{at}(s) to the system as their dependencies are resolved. As well as potentially halting or suspending a \gls{at} that dynamically detect unfulfilled dependencies. Each \gls{at} has the responsability of adding the dependent \glspl{at} back in the system once completed. As a consequence, the scheduler may have an incomplete view of the system, seeing only \glspl{at} we no pending dependencies. Schedulers that support this detection at runtime are referred to as \newterm{Dynamic Schedulers}. \subsection{Explicitly Informed Dynamic Schedulers} While dynamic schedulers do not have access to an exhaustive list of dependencies for a task, they may require to provide more or less information about each task, including for example: expected duration, required ressources, relative importance, etc. The scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information} Precisely providing this information can be difficult for programmers, especially \emph{predicted} behaviour, and the scheduler may need to support some amount of imprecision in the provided information. For example, specifying that a tasks takes approximately 5 seconds to complete, rather than exactly 5 seconds. User provided information can also become a significant burden depending how the effort to provide the information scales with the number of tasks and there complexity. For example, providing an exhaustive list of files read by 5 tasks is an easier requirement the providing an exhaustive list of memory addresses accessed by 10'000 distinct tasks. While dynamic schedulers do not have access to an exhaustive list of dependencies for a \gls{at}, they may require to provide more or less information about each \gls{at}, including for example: expected duration, required ressources, relative importance, etc. The scheduler can then use this information to direct the scheduling decisions. \cit{Examples of schedulers with more information} Precisely providing this information can be difficult for programmers, especially \emph{predicted} behaviour, and the scheduler may need to support some amount of imprecision in the provided information. For example, specifying that a \glspl{at} takes approximately 5 seconds to complete, rather than exactly 5 seconds. User provided information can also become a significant burden depending how the effort to provide the information scales with the number of \glspl{at} and there complexity. For example, providing an exhaustive list of files read by 5 \glspl{at} is an easier requirement the providing an exhaustive list of memory addresses accessed by 10'000 distinct \glspl{at}. Since the goal of this thesis is to provide a scheduler as a replacement for \CFA's existing \emph{uninformed} scheduler, Explicitly Informed schedulers are less relevant to this project. Nevertheless, some strategies are worth mentionnding. \subsubsection{Prority Scheduling} A commonly used information that schedulers used to direct the algorithm is priorities. Each Task is given a priority and higher-priority tasks are preferred to lower-priority ones. The simplest priority scheduling algorithm is to simply require that every task have a distinct pre-established priority and always run the available task with the highest priority. Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of tasks. It can therefore be diserable for schedulers to support tasks with identical priorities and/or automatically setting and adjusting priorites for tasks. A commonly used information that schedulers used to direct the algorithm is priorities. Each Task is given a priority and higher-priority \glspl{at} are preferred to lower-priority ones. The simplest priority scheduling algorithm is to simply require that every \gls{at} have a distinct pre-established priority and always run the available \gls{at} with the highest priority. Asking programmers to provide an exhaustive set of unique priorities can be prohibitive when the system has a large number of \glspl{at}. It can therefore be diserable for schedulers to support \glspl{at} with identical priorities and/or automatically setting and adjusting priorites for \glspl{at}. The most common operating some variation on priorities with overlaps and dynamic priority adjustments. For example, Microsoft Windows uses a pair of priorities \cit{https://docs.microsoft.com/en-us/windows/win32/procthread/scheduling-priorities,https://docs.microsoft.com/en-us/windows/win32/taskschd/taskschedulerschema-priority-settingstype-element}, one specified by users out of ten possible options and one adjusted by the system. \subsection{Uninformed and Self-Informed Dynamic Schedulers} Several scheduling algorithms do not require programmers to provide additionnal information on each task, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler. Several scheduling algorithms do not require programmers to provide additionnal information on each \gls{at}, and instead make scheduling decisions based solely on internal state and/or information implicitly gathered by the scheduler. \subsubsection{Feedback Scheduling} As mentionned, Schedulers may also gather information about each tasks to direct their decisions. This design effectively moves the scheduler to some extent into the realm of \newterm{Control Theory}\cite{wiki:controltheory}. This gathering does not generally involve programmers and as such does not increase programmer burden the same way explicitly provided information may. However, some feedback schedulers do offer the option to programmers to offer additionnal information on certain tasks, in order to direct scheduling decision. The important distinction being whether or not the scheduler can function without this additionnal information. As mentionned, Schedulers may also gather information about each \glspl{at} to direct their decisions. This design effectively moves the scheduler to some extent into the realm of \newterm{Control Theory}\cite{wiki:controltheory}. This gathering does not generally involve programmers and as such does not increase programmer burden the same way explicitly provided information may. However, some feedback schedulers do offer the option to programmers to offer additionnal information on certain \glspl{at}, in order to direct scheduling decision. The important distinction being whether or not the scheduler can function without this additionnal information. \section{Work Stealing}\label{existing:workstealing} One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work-stealing. This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker work on its local tasks first, but allows the possibility for other workers to steal local tasks if they run out of tasks. \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has queue of tasks to accomplish and workers without tasks steal tasks from random workers. (The Burton and Sleep algorithm had trees of tasks and stole only among neighbours). Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations. One of the most popular scheduling algorithm in practice (see~\ref{existing:prod}) is work-stealing. This idea, introduce by \cite{DBLP:conf/fpca/BurtonS81}, effectively has each worker work on its local \glspl{at} first, but allows the possibility for other workers to steal local \glspl{at} if they run out of \glspl{at}. \cite{DBLP:conf/focs/Blumofe94} introduced the more familiar incarnation of this, where each workers has queue of \glspl{at} to accomplish and workers without \glspl{at} steal \glspl{at} from random workers. (The Burton and Sleep algorithm had trees of \glspl{at} and stole only among neighbours). Blumofe and Leiserson also prove worst case space and time requirements for well-structured computations. Many variations of this algorithm have been proposed over the years\cite{DBLP:journals/ijpp/YangH18}, both optmizations of existing implementations and approaches that account for new metrics. \paragraph{Granularity} A significant portion of early Work Stealing research was concentrating on \newterm{Implicit Parellelism}\cite{wiki:implicitpar}. Since the system was responsible to split the work, granularity is a challenge that cannot be left to the programmers (as opposed to \newterm{Explicit Parellelism}\cite{wiki:explicitpar} where the burden can be left to programmers). In general, fine granularity is better for load balancing and coarse granularity reduces communication overhead. The best performance generally means finding a middle ground between the two. Several methods can be employed, but I believe these are less relevant for threads, which are generally explicit and more coarse grained. \paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating tasks from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93} \paragraph{Task Placement} Since modern computers rely heavily on cache hierarchies\cit{Do I need a citation for this}, migrating \glspl{at} from one core to another can be .  \cite{DBLP:journals/tpds/SquillanteL93} \todo{The survey is not great on this subject} \subsection{Theoretical Results} There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogenous systems\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. \cite{DBLP:journals/jacm/BlellochGM99} examine the space bounds of Work Stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} show that for underloaded systems, the scheduler will complete computations in finite time, \ie is \newterm{stable}. Others show that Work-Stealing is applicable to various scheduling contexts\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. \cite{DBLP:conf/ipps/ColeR13} also studied how Randomized Work Stealing affects false sharing among tasks. There is also a large body of research on the theoretical aspects of work stealing. These evaluate, for example, the cost of migration\cite{DBLP:conf/sigmetrics/SquillanteN91,DBLP:journals/pe/EagerLZ86}, how affinity affects performance\cite{DBLP:journals/tpds/SquillanteL93,DBLP:journals/mst/AcarBB02,DBLP:journals/ipl/SuksompongLS16} and theoretical models for heterogenous systems\cite{DBLP:journals/jpdc/MirchandaneyTS90,DBLP:journals/mst/BenderR02,DBLP:conf/sigmetrics/GastG10}. \cite{DBLP:journals/jacm/BlellochGM99} examine the space bounds of Work Stealing and \cite{DBLP:journals/siamcomp/BerenbrinkFG03} show that for underloaded systems, the scheduler will complete computations in finite time, \ie is \newterm{stable}. Others show that Work-Stealing is applicable to various scheduling contexts\cite{DBLP:journals/mst/AroraBP01,DBLP:journals/anor/TchiboukdjianGT13,DBLP:conf/isaac/TchiboukdjianGTRB10,DBLP:conf/ppopp/AgrawalLS10,DBLP:conf/spaa/AgrawalFLSSU14}. \cite{DBLP:conf/ipps/ColeR13} also studied how Randomized Work Stealing affects false sharing among \glspl{at}. However, as \cite{DBLP:journals/ijpp/YangH18} highlights, it is worth mentionning that this theoretical research has mainly focused on fully-strict'' computations, \ie workloads that can be fully represented with a Direct Acyclic Graph. It is unclear how well these distributions represent workloads in real world scenarios. \section{Preemption} One last aspect of scheduling worth mentionning is preemption since many schedulers rely on it for some of their guarantees. Preemption is the idea of interrupting tasks that have been running for too long, effectively injecting suspend points in the applications. There are multiple techniques to achieve this but they all aim to have the effect of guaranteeing that suspend points in a task are never further apart than some fixed duration. While this helps schedulers guarantee that no tasks will unfairly monopolize a worker, preemption can effectively added to any scheduler. Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it. One last aspect of scheduling worth mentionning is preemption since many schedulers rely on it for some of their guarantees. Preemption is the idea of interrupting \glspl{at} that have been running for too long, effectively injecting suspend points in the applications. There are multiple techniques to achieve this but they all aim to have the effect of guaranteeing that suspend points in a \gls{at} are never further apart than some fixed duration. While this helps schedulers guarantee that no \glspl{at} will unfairly monopolize a worker, preemption can effectively added to any scheduler. Therefore, the only interesting aspect of preemption for the design of scheduling is whether or not to require it. \section{Schedulers in Production}\label{existing:prod} \subsection{Operating System Schedulers} Operating System Schedulers tend to be fairly complex schedulers, they generally support some amount of real-time, aim to balance interactive and non-interactive tasks and support for multiple users sharing hardware without requiring these users to cooperate. Here are more details on a few schedulers used in the common operating systems: Linux, FreeBsd, Microsoft Windows and Apple's OS X. The information is less complete for operating systems behind closed source. Operating System Schedulers tend to be fairly complex schedulers, they generally support some amount of real-time, aim to balance interactive and non-interactive \glspl{at} and support for multiple users sharing hardware without requiring these users to cooperate. Here are more details on a few schedulers used in the common operating systems: Linux, FreeBsd, Microsoft Windows and Apple's OS X. The information is less complete for operating systems behind closed source. \paragraph{Linux's CFS} The default scheduler used by Linux (the Completely Fair Scheduler)\cite{MAN:linux/cfs,MAN:linux/cfs2} is a feedback scheduler based on CPU time. For each processor, it constructs a Red-Black tree of tasks waiting to run, ordering them by amount of CPU time spent. The scheduler schedules the task that has spent the least CPU time. It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time spent. The ordering of tasks is also impacted by a group based notion of fairness, where tasks belonging to groups having spent less CPU time are preferred to tasks beloning to groups having spent more CPU time. Linux achieves load-balancing by regularly monitoring the system state\cite{MAN:linux/cfs/balancing} and using some heuristic on the load (currently CPU time spent in the last millisecond plus decayed version of the previous time slots\cite{MAN:linux/cfs/pelt}.). The default scheduler used by Linux (the Completely Fair Scheduler)\cite{MAN:linux/cfs,MAN:linux/cfs2} is a feedback scheduler based on CPU time. For each processor, it constructs a Red-Black tree of \glspl{at} waiting to run, ordering them by amount of CPU time spent. The scheduler schedules the \gls{at} that has spent the least CPU time. It also supports the concept of \newterm{Nice values}, which are effectively multiplicative factors on the CPU time spent. The ordering of \glspl{at} is also impacted by a group based notion of fairness, where \glspl{at} belonging to groups having spent less CPU time are preferred to \glspl{at} beloning to groups having spent more CPU time. Linux achieves load-balancing by regularly monitoring the system state\cite{MAN:linux/cfs/balancing} and using some heuristic on the load (currently CPU time spent in the last millisecond plus decayed version of the previous time slots\cite{MAN:linux/cfs/pelt}.). \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work-stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly. The issues highlighted sem to stem from Linux's need to support fairness across tasks \emph{and} across users\footnote{Enforcing fairness across users means, for example, that given two users: one with a single task and the other with one thousand tasks, the user with a single task does not receive one one thousandth of the CPU time.}, increasing the complexity. \cite{DBLP:conf/eurosys/LoziLFGQF16} shows that Linux's CFS also does work-stealing to balance the workload of each processors, but the paper argues this aspect can be improved significantly. The issues highlighted sem to stem from Linux's need to support fairness across \glspl{at} \emph{and} across users\footnote{Enforcing fairness across users means, for example, that given two users: one with a single \gls{at} and the other with one thousand \glspl{at}, the user with a single \gls{at} does not receive one one thousandth of the CPU time.}, increasing the complexity. Linux also offers a FIFO scheduler, a real-time schedulerwhich runs the highest-priority task, and a round-robin scheduler, which is an extension of the fifo-scheduler that adds fixed time slices. \cite{MAN:linux/sched} Linux also offers a FIFO scheduler, a real-time schedulerwhich runs the highest-priority \gls{at}, and a round-robin scheduler, which is an extension of the fifo-scheduler that adds fixed time slices. \cite{MAN:linux/sched} \paragraph{FreeBSD} \paragraph{Windows(OS)} Microsoft's Operating System's Scheduler\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. It supports 32 levels of priorities, some of which are reserved for real-time and prviliged applications. It schedules tasks based on the highest priorities (lowest number) and how much cpu time each tasks have used. The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. Microsoft's Operating System's Scheduler\cite{MAN:windows/scheduler} is a feedback scheduler with priorities. It supports 32 levels of priorities, some of which are reserved for real-time and prviliged applications. It schedules \glspl{at} based on the highest priorities (lowest number) and how much cpu time each \glspl{at} have used. The scheduler may also temporarily adjust priorities after certain effects like the completion of I/O requests. \todo{load balancing} \subsection{User-Level Schedulers} By comparison, user level schedulers tend to be simpler, gathering fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all tasks have the same user, and therefore cooperation is both feasible and probable. By comparison, user level schedulers tend to be simpler, gathering fewer metrics and avoid complex notions of fairness. Part of the simplicity is due to the fact that all \glspl{at} have the same user, and therefore cooperation is both feasible and probable. \paragraph{Go} Go's scheduler uses a Randomized Work Stealing algorithm that has a global runqueue(\emph{GRQ}) and each processor(\emph{P}) has both a fixed-size runqueue(\emph{LRQ}) and a high-priority next chair'' holding a single element.\cite{GITHUB:go,YTUBE:go} Preemption is present, but only at function call boundaries. \paragraph{Intel\textregistered ~Threading Building Blocks} \newterm{Thread Building Blocks}(TBB) is Intel's task parellelism\cite{wiki:taskparallel} framework. It runs tasks or \newterm{jobs}, schedulable objects that must always run to completion, on a pool of worker threads. TBB's scheduler is a variation of Randomized Work Stealing that also supports higher-priority graph-like dependencies\cite{MAN:tbb/scheduler}. It schedules tasks as follows (where \textit{t} is the last task completed): \newterm{Thread Building Blocks}(TBB) is Intel's task parellelism\cite{wiki:taskparallel} framework. It runs \newterm{jobs}, uninterruptable \glspl{at}, schedulable objects that must always run to completion, on a pool of worker threads. TBB's scheduler is a variation of Randomized Work Stealing that also supports higher-priority graph-like dependencies\cite{MAN:tbb/scheduler}. It schedules \glspl{at} as follows (where \textit{t} is the last \gls{at} completed): \begin{displayquote} \begin{enumerate} \paragraph{Grand Central Dispatch} This is an API produce by Apple\cit{Official GCD source} that offers task parellelism\cite{wiki:taskparallel}. Its distinctive aspect is that it uses multiple Dispatch Queues'', some of which are created by programmers. These queues each have their own local ordering guarantees, \eg tasks on queue $A$ are executed in \emph{FIFO} order. This is an API produce by Apple\cit{Official GCD source} that offers task parellelism\cite{wiki:taskparallel}. Its distinctive aspect is that it uses multiple Dispatch Queues'', some of which are created by programmers. These queues each have their own local ordering guarantees, \eg \glspl{at} on queue $A$ are executed in \emph{FIFO} order. \todo{load balancing and scheduling}
 r1a9592a The consequence is that the amount of parallelism used to prepare submissions for the next system call is limited. Beyond this limit, the length of the system call is the throughput limiting factor. I concluded from early experiments that preparing submissions seems to take about as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. I concluded from early experiments that preparing submissions seems to take at most as long as the system call itself, which means that with a single @io_uring@ instance, there is no benefit in terms of \io throughput to having more than two \glspl{hthrd}. Therefore the design of the submission engine must manage multiple instances of @io_uring@ running in parallel, effectively sharding @io_uring@ instances. Similarly to scheduling, this sharding can be done privately, \ie, one instance per \glspl{proc}, in decoupled pools, \ie, a pool of \glspl{proc} use a pool of @io_uring@ instances without one-to-one coupling between any given instance and any given \gls{proc}, or some mix of the two. The only added complexity is that the number of SQEs is fixed, which means allocation can fail. Allocation failures need to be pushed up to the routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. Allocation failures need to be pushed up to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. Furthermore, the routing algorithm should block operations up-front if none of the instances have available SQEs. In the case of designating a \gls{thrd}, ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests would be batched together and one of the \glspl{thrd} would do the system call on behalf of the others, referred to as the \newterm{submitter}. In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a current submitter and a next submitter. In practice however, it is important that the \io requests are not left pending indefinitely and as such, it may be required to have a next submitter'' that guarentees everything that is missed by the current submitter is seen by the next one. Indeed, as long as there is a next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call will include their request. Once the system call is done, the submitter must also free SQEs so that the allocator can reused them. If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled. This design is especially convenient for reasons explained in Chapter~\ref{practice}. With this pool of instances approach, the big advantage is that it is fairly flexible. It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handles running out of ressources, SQEs or the kernel returning @EBUSY@. It also can gracefully handle running out of ressources, SQEs or the kernel returning @EBUSY@. The down side to this is that many of the steps used for submitting need complex synchronization to work properly. The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed. The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused and handle the kernel returning @EBUSY@. All this synchronization may have a significant cost and, compare to the next approach presented, this synchronization is entirely overhead. All this synchronization may have a significant cost and, compared to the next approach presented, this synchronization is entirely overhead. \subsubsection{Private Instances} Another approach is to simply create one ring instance per \gls{proc}. This alleviate the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not interrupted in between two submission steps. This is effectively the same requirement as using @thread_local@ variables. Since SQEs that are allocated must be submitted to the same ring, on the same \gls{proc}, this effectively forces the application to submit SQEs in allocation order