# Changeset a44514e

Ignore:
Timestamp:
Sep 7, 2022, 4:12:00 PM (5 months ago)
Branches:
Children:
e4855f6
Parents:
7a0f798b
Message:

A whole bunch of small changes:
trying to setup a version that I can pass through a spell checker.
Fixing a whole bunch of grammar errors

Location:
doc/theses/thierry_delisle_PhD
Files:
15 edited

Unmodified
Removed
• ## doc/theses/thierry_delisle_PhD/.gitignore

 r7a0f798b thesis/fig/*.fig.bak thesis/thesis.pdf thesis/thesis.tty thesis/thesis.ps
• ## doc/theses/thierry_delisle_PhD/thesis/Makefile

 r7a0f798b LaTeX  = TEXINPUTS=${TeXLIB} && export TEXINPUTS && latex -halt-on-error -output-directory=${Build} BibTeX = BIBINPUTS=${TeXLIB} && export BIBINPUTS && bibtex DeTeX = TEXINPUTS=${TeXLIB} && export TEXINPUTS && detex -r MAKEFLAGS = --no-print-directory # --silent ${LaTeX}$< %.tty: build/%.dvi dvi2tty -w132 $< >$@ ## Define the default recipes.

• ## doc/theses/thierry_delisle_PhD/thesis/local.bib

 r7a0f798b } @misc{GITHUB:SchedulingBenchmarks, title = {Scheduling Benchmarks}, author = {Thierry Delisle}, howpublished = {\href{https://github.com/cforall/SchedulingBenchmarks_PhD22}{https://\-github.com/\-cforall/\-SchedulingBenchmarks\_\-PhD22}}, } % -------------------------------------------------- % Tech documents } @manual{MAN:eventfd, key        = "eventfd", title      = "eventfd(2) Linux User's Manual", year       = "2019", month      = "MArch", } @manual{MAN:aio, key        = "aio", howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", note = "[Online; accessed 5-August-2022]" } @misc{wiki:htm, author = "{Wikipedia contributors}", title = "Transactional memory --- {W}ikipedia{,} The Free Encyclopedia", year = "2022", howpublished = "\href{https://en.wikipedia.org/wiki/Zipf%27s_law}{https://\-en.wikipedia.org/\-wiki/\-Zipf\%27s\-\_law}", note = "[Online; accessed 7-September-2022]" }
• ## doc/theses/thierry_delisle_PhD/thesis/text/conclusion.tex

 r7a0f798b In both of these examples, some care is needed to ensure that reads to an address \emph{sometime} retire. Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{HTM}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. Note, this idea is similar to \newterm{Hardware Transactional Memory}~\cite{wiki:htm}, which allows groups of instructions to be aborted and rolled-back if they encounter memory conflicts when being retired. However, I believe this feature is generally aimed at large groups of instructions. A more fine-grained approach may be more amenable by carefully picking which aspects of an algorithm require exact correctness and which do not.
• ## doc/theses/thierry_delisle_PhD/thesis/text/core.tex

 r7a0f798b Before discussing scheduling in general, where it is important to address systems that are changing states, this document discusses scheduling in a somewhat ideal scenario, where the system has reached a steady state. For this purpose, a steady state is loosely defined as a state where there are always \glspl{thrd} ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers. For this purpose, a steady state is loosely defined as a state where there are always \ats ready to run and the system has the resources necessary to accomplish the work, \eg, enough workers. In short, the system is neither overloaded nor underloaded. It is important to discuss the steady state first because it is the easiest case to handle and, relatedly, the case in which the best performance is to be expected. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new load and return to the steady state, \eg, by adding or removing workers. As such, when the system is either overloaded or underloaded, a common approach is to try to adapt the system to this new \gls{load} and return to the steady state, \eg, by adding or removing workers. Therefore, flaws in scheduling the steady state tend to be pervasive in all states. \end{displayquote} Applied to threads, this model states that every ready \gls{thrd} immediately runs in parallel with all other ready \glspl{thrd}. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model. In general, the expectation at the center of this model is that ready \glspl{thrd} do not interfere with each other but simply share the hardware. This assumption makes it easier to reason about threading because ready \glspl{thrd} can be thought of in isolation and the effect of the scheduler can be virtually ignored. This expectation of \gls{thrd} independence means the scheduler is expected to offer two guarantees: Applied to \ats, this model states that every ready \at immediately runs in parallel with all other ready \ats. While a strict implementation of this model is not feasible, programmers still have expectations about scheduling that come from this model. In general, the expectation at the center of this model is that ready \ats do not interfere with each other but simply share the hardware. This assumption makes it easier to reason about threading because ready \ats can be thought of in isolation and the effect of the scheduler can be virtually ignored. This expectation of \at independence means the scheduler is expected to offer two guarantees: \begin{enumerate} \item A fairness guarantee: a \gls{thrd} that is ready to run is not prevented by another thread. \item A performance guarantee: a \gls{thrd} that wants to start or stop running is not prevented by other threads wanting to do the same. \item A fairness guarantee: a \at that is ready to run is not prevented by another thread. \item A performance guarantee: a \at that wants to start or stop running is not prevented by other threads wanting to do the same. \end{enumerate} It is important to note that these guarantees are expected only up to a point. \Glspl{thrd} that are ready to run should not be prevented to do so, but they still share the limited hardware resources. Therefore, the guarantee is considered respected if a \gls{thrd} gets access to a \emph{fair share} of the hardware resources, even if that share is very small. \Glspl{at} that are ready to run should not be prevented to do so, but they still share the limited hardware resources. Therefore, the guarantee is considered respected if a \at gets access to a \emph{fair share} of the hardware resources, even if that share is very small. Similar to the performance guarantee, the lack of interference among threads is only relevant up to a point. For interactive applications that need to run at 60, 90, 120 frames per second, \ats having to wait for several milliseconds to run are effectively starved. Therefore load-balancing should be done at a faster pace, one that can detect starvation at the microsecond scale. With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general load of the system. With that said, this is a much fuzzier requirement since it depends on the number of \procs, the number of \ats and the general \gls{load} of the system. \subsection{Fairness vs Scheduler Locality} \label{fairnessvlocal} For a scheduler, having good locality, \ie, having the data local to each \gls{hthrd}, generally conflicts with fairness. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \gls{thrd}, and as consequence cache lines, to a \gls{hthrd} that is currently available. Indeed, good locality often requires avoiding the movement of cache lines, while fairness requires dynamically moving a \at, and as consequence cache lines, to a \gls{hthrd} that is currently available. Note that this section discusses \emph{internal locality}, \ie, the locality of the data used by the scheduler versus \emph{external locality}, \ie, how the data used by the application is affected by scheduling. External locality is a much more complicated subject and is discussed in the next section. \input{fairness.pstex_t} \vspace*{-10pt} \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \gls{thrd} awaits running is shown as the time the ready \gls{thrd} waits increases, Ready Time, the chances that its data is still in cache decreases, Locality. At the same time, the need for fairness increases since other \glspl{thrd} may have the chance to run many times, breaking the fairness model. \caption[Fairness vs Locality graph]{Rule of thumb Fairness vs Locality graph \smallskip\newline The importance of Fairness and Locality while a ready \at awaits running is shown as the time the ready \at waits increases, Ready Time, the chances that its data is still in cache decreases, Locality. At the same time, the need for fairness increases since other \ats may have the chance to run many times, breaking the fairness model. Since the actual values and curves of this graph can be highly variable, the graph is an idealized representation of the two opposing goals.} \label{fig:fair} \subsubsection{Migration Cost} Another important source of scheduling latency is migration. Another important source of scheduling latency is \glslink{atmig}{migration}. A \at migrates if it executes on two different \procs consecutively, which is the process discussed in \ref{fairnessvlocal}. Migrations can have many different causes, but in certain programs, it can be impossible to limit migration. To compare subqueues, the timestamp at the head must be compared to the current time, yielding the best-case wait-time for the \at at the head of the queue. This new waiting is averaged with the stored average. To further limit migration, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average. To further limit \glslink{atmig}{migrations}, a bias can be added to a local subqueue, where a remote subqueue is helped only if its moving average is more than $X$ times the local subqueue's average. Tests for this approach indicate the choice of the weight for the moving average or the bias is not important, \ie weights and biases of similar \emph{magnitudes} have similar effects. The good news is that this problem can be mitigated \subsection{Redundant Timestamps}\ref{relaxedtimes} \subsection{Redundant Timestamps}\label{relaxedtimes} The problem with polling remote subqueues is that correctness is critical. There must be a consensus among \procs on which subqueues hold which \ats, as the \ats are in constant motion.

• ## doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

 r7a0f798b \Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages. The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems. The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multicore systems. Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities. To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads; which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated. which begs the question of how many kernel threads are needed and should the number be dynamically reevaluated. Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops. When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources. This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}. The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts. Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. Preventing kernel blocking is accomplished by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking. Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases. After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system. The goal of the new scheduler is to offer increased safety and productivity without sacrificing performance. The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness. The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with, the aim of showing equivalent or better performance while offering better fairness. Chapter~\ref{intro} defines scheduling and its general goals. Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others. Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation. Chapter~\ref{cfaruntime} presents the relevant aspects of the \CFA runtime system that have a significant effect on the new scheduler design and implementation. Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness. Chapter~\ref{userio} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}. \section{Scheduling}\label{sched} Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones. On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}. On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors efficiently, called \newterm{scheduling}. Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit. In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled. However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns. Scheduling occurs at discreet points when there are transitions in a system. For example, a thread cycles through the following transitions during its execution. Scheduling occurs at discrete points when there are transitions in a system. For example, a \at cycles through the following transitions during its execution. \begin{center} \input{executionStates.pstex_t} entering the system (new $\rightarrow$ ready) \item scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running) scheduler assigns a \at to a computing resource, \eg CPU (ready $\rightarrow$ running) \item timer alarm for preemption (running $\rightarrow$ ready) normal completion or error, \eg segment fault (running $\rightarrow$ halted) \end{itemize} Key to scheduling is that a thread cannot bypass the ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads. Key to scheduling is that a \at cannot bypass the ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads. When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}. \end{tabular} \end{center} Beyond these two schedulers are a host of options, \eg adding an global shared queue to MQMS or adding multiple private queues with distinc characteristics. Beyond these two schedulers are a host of options, \eg adding a global shared queue to MQMS or adding multiple private queues with distinct characteristics. Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria: Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy. When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}. That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors. That is, threads must be scheduled on different processors to obtain high processor utilization because the number of threads $\ggg$ processors. \item More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity). The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}. To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application. To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside the application. As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
• ## doc/theses/thierry_delisle_PhD/thesis/text/io.tex

 r7a0f798b \chapter{User Level \io}\label{userio} As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations. As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \ats onto fewer \glspl{proc} using asynchronous \io operations. Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system. It does not mean an operation returning \lstinline{EAGAIN} succeeds on the next try. For example, a ready read may only return a subset of requested bytes and the read must be issues again for the remaining bytes, at which point it may return \lstinline{EAGAIN}.} This mechanism is also crucial in determining when all \glspl{thrd} are blocked and the application \glspl{kthrd} can now block. This mechanism is also crucial in determining when all \ats are blocked and the application \glspl{kthrd} can now block. There are three options to monitor file descriptors in Linux:\footnote{ Often the I/O manager has a timeout, polls, or is sent a signal on changes to mitigate this problem. \begin{comment} From: Tim Brecht Subject: Re: FD sets Date: Wed, 6 Jul 2022 00:29:41 +0000 Large number of open files -------------------------- In order to be able to use more than the default number of open file descriptors you may need to: o increase the limit on the total number of open files /proc/sys/fs/file-max (on Linux systems) o increase the size of FD_SETSIZE - the way I often do this is to figure out which include file __FD_SETSIZE is defined in, copy that file into an appropriate directory in ./include, and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size gets used For example on a RH 9.0 distribution I've copied /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h Then I modify typesizes.h to look something like: #ifdef BIGGER_FD_SETSIZE #define __FD_SETSIZE            32767 #else #define __FD_SETSIZE            1024 #endif Note that the since I'm moving and testing the userver on may different machines the Makefiles are set up to use -I ./include/$(HOSTTYPE) This way if you redefine the FD_SETSIZE it will get used instead of the default original file. \end{comment} % \begin{comment} % From: Tim Brecht % Subject: Re: FD sets % Date: Wed, 6 Jul 2022 00:29:41 +0000 % Large number of open files % -------------------------- % In order to be able to use more than the default number of open file % descriptors you may need to: % o increase the limit on the total number of open files /proc/sys/fs/file-max % (on Linux systems) % o increase the size of FD_SETSIZE % - the way I often do this is to figure out which include file __FD_SETSIZE % is defined in, copy that file into an appropriate directory in ./include, % and then modify it so that if you use -DBIGGER_FD_SETSIZE the larger size % gets used % For example on a RH 9.0 distribution I've copied % /usr/include/bits/typesizes.h into ./include/i386-linux/bits/typesizes.h % Then I modify typesizes.h to look something like: % #ifdef BIGGER_FD_SETSIZE % #define __FD_SETSIZE 32767 % #else % #define __FD_SETSIZE 1024 % #endif % Note that the since I'm moving and testing the userver on may different % machines the Makefiles are set up to use -I ./include/$(HOSTTYPE) %   This way if you redefine the FD_SETSIZE it will get used instead of the %   default original file. % \end{comment} \paragraph{\lstinline{poll}} is the next oldest option, and takes as input an array of structures containing the FD numbers rather than their position in an array of bits, allowing a more compact input for interest sets that contain widely spaced FDs. \subsection{Extra Kernel Threads}\label{io:morethreads} Finally, if the operating system does not offer a satisfactory form of asynchronous \io operations, an ad-hoc solution is to create a pool of \glspl{kthrd} and delegate operations to it to avoid blocking \glspl{proc}, which is a compromise for multiplexing. In the worst case, where all \glspl{thrd} are consistently blocking on \io, it devolves into 1-to-1 threading. However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \glspl{thrd} are ready to run. In the worst case, where all \ats are consistently blocking on \io, it devolves into 1-to-1 threading. However, regardless of the frequency of \io operations, it achieves the fundamental goal of not blocking \glspl{proc} when \ats are ready to run. This approach is used by languages like Go~\cite{GITHUB:go}, frameworks like libuv~\cite{libuv}, and web servers like Apache~\cite{apache} and NGINX~\cite{nginx}, since it has the advantage that it can easily be used across multiple operating systems. This advantage is especially relevant for languages like Go, which offer a homogeneous \glsxtrshort{api} across all platforms. \section{Event-Engine} An event engine's responsibility is to use the kernel interface to multiplex many \io operations onto few \glspl{kthrd}. In concrete terms, this means \glspl{thrd} enter the engine through an interface, the event engine then starts an operation and parks the calling \glspl{thrd}, returning control to the \gls{proc}. The parked \glspl{thrd} are then rescheduled by the event engine once the desired operation has completed. In concrete terms, this means \ats enter the engine through an interface, the event engine then starts an operation and parks the calling \ats, returning control to the \gls{proc}. The parked \ats are then rescheduled by the event engine once the desired operation has completed. \subsection{\lstinline{io_uring} in depth}\label{iouring} \subsubsection{Private Instances} The private approach creates one ring instance per \gls{proc}, \ie one-to-one coupling. This alleviates the need for synchronization on the submissions, requiring only that \glspl{thrd} are not time-sliced during submission steps. This requirement is the same as accessing @thread_local@ variables, where a \gls{thrd} is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data. This alleviates the need for synchronization on the submissions, requiring only that \ats are not time-sliced during submission steps. This requirement is the same as accessing @thread_local@ variables, where a \at is accessing kernel-thread data, is time-sliced, and continues execution on another kernel thread but is now accessing the wrong data. This failure is the serially reusable problem~\cite{SeriallyReusable}. Hence, allocated SQEs must be submitted to the same ring on the same \gls{proc}, which effectively forces the application to submit SQEs in allocation order.\footnote{ To remove this requirement, a \gls{thrd} needs the ability to yield to a specific \gls{proc}'', \ie, park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} To remove this requirement, a \at needs the ability to yield to a specific \gls{proc}'', \ie, \park with the guarantee it unparks on a specific \gls{proc}, \ie the \gls{proc} attached to the correct ring.} From the subsystem's point of view, the allocation and submission are sequential, greatly simplifying both. In this design, allocation and submission form a partitioned ring buffer as shown in Figure~\ref{fig:pring}. Once added to the ring buffer, the attached \gls{proc} has a significant amount of flexibility with regards to when to perform the system call. Possible options are: when the \gls{proc} runs out of \glspl{thrd} to run, after running a given number of \glspl{thrd}, \etc. Possible options are: when the \gls{proc} runs out of \ats to run, after running a given number of \ats, \etc. \begin{figure} This approach has the advantage that it does not require much of the synchronization needed in a shared approach. However, this benefit means \glspl{thrd} submitting \io operations have less flexibility: they cannot park or yield, and several exceptional cases are handled poorly. Instances running out of SQEs cannot run \glspl{thrd} wanting to do \io operations. In this case, the \io \gls{thrd} needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed. However, this benefit means \ats submitting \io operations have less flexibility: they cannot \park or yield, and several exceptional cases are handled poorly. Instances running out of SQEs cannot run \ats wanting to do \io operations. In this case, the \io \at needs to be moved to a different \gls{proc}, and the only current way of achieving this is to @yield()@ hoping to be scheduled on a different \gls{proc} with free SQEs, which is not guaranteed. A more involved version of this approach tries to solve these problems using a pattern called \newterm{helping}. \Glspl{thrd} that cannot submit \io operations, either because of an allocation failure or migration to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \glspl{thrd} to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. Imagine a simple scenario with two \glspl{thrd} on two \glspl{proc}, where one \gls{thrd} submits an \io operation and then sets a flag, while the other \gls{thrd} spins until the flag is set. Assume both \glspl{thrd} are running on the same \gls{proc}, and the \io \gls{thrd} is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \gls{thrd}. In this case, the helping solution has the \io \gls{thrd} append an \io object to the submission list of the first \gls{proc}, where the allocation was made. No other \gls{proc} can help the \gls{thrd} since @io_uring@ instances are strongly coupled to \glspl{proc}. However, the \io \gls{proc} is unable to help because it is executing the spinning \gls{thrd} resulting in a deadlock. While this example is artificial, in the presence of many \glspl{thrd}, it is possible for this problem to arise in the wild''. \ats that cannot submit \io operations, either because of an allocation failure or \glslink{atmig}{migration} to a different \gls{proc} between allocation and submission, create an \io object and add it to a list of pending submissions per \gls{proc} and a list of pending allocations, probably per cluster. While there is still the strong coupling between \glspl{proc} and @io_uring@ instances, these data structures allow moving \ats to a specific \gls{proc}, when the current \gls{proc} cannot fulfill the \io request. Imagine a simple scenario with two \ats on two \glspl{proc}, where one \at submits an \io operation and then sets a flag, while the other \at spins until the flag is set. Assume both \ats are running on the same \gls{proc}, and the \io \at is preempted between allocation and submission, moved to the second \gls{proc}, and the original \gls{proc} starts running the spinning \at. In this case, the helping solution has the \io \at append an \io object to the submission list of the first \gls{proc}, where the allocation was made. No other \gls{proc} can help the \at since @io_uring@ instances are strongly coupled to \glspl{proc}. However, the \io \gls{proc} is unable to help because it is executing the spinning \at resulting in a deadlock. While this example is artificial, in the presence of many \ats, it is possible for this problem to arise in the wild''. Furthermore, this pattern is difficult to reliably detect and avoid. Once in this situation, the only escape is to interrupted the spinning \gls{thrd}, either directly or via some regular preemption, \eg time slicing. Having to interrupt \glspl{thrd} for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect. Once in this situation, the only escape is to interrupted the spinning \at, either directly or via some regular preemption, \eg time slicing. Having to interrupt \ats for this purpose is costly, the latency can be large between interrupts, and the situation may be hard to detect. Interrupts are needed here entirely because the \gls{proc} is tied to an instance it is not using. Therefore, a more satisfying solution is for the \gls{thrd} submitting the operation to notice that the instance is unused and simply go ahead and use it. Therefore, a more satisfying solution is for the \at submitting the operation to notice that the instance is unused and simply go ahead and use it. This approach is presented shortly. \subsubsection{Public Instances} The public approach creates decoupled pools of @io_uring@ instances and processors, \ie without one-to-one coupling. \Glspl{thrd} attempting an \io operation pick one of the available instances and submit the operation to that instance. Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \glspl{thrd} running on more than one \gls{proc} can attempt to submit to the same instance concurrently. \ats attempting an \io operation pick one of the available instances and submit the operation to that instance. Since there is no coupling between @io_uring@ instances and \glspl{proc} in this approach, \ats running on more than one \gls{proc} can attempt to submit to the same instance concurrently. Because @io_uring@ effectively sets the amount of sharding needed to avoid contention on its internal locks, performance in this approach is based on two aspects: \begin{itemize} The only added complexity is that the number of SQEs is fixed, which means allocation can fail. Allocation failures need to be pushed to a routing algorithm: \glspl{thrd} attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. Allocation failures need to be pushed to a routing algorithm: \ats attempting \io operations must not be directed to @io_uring@ instances without sufficient SQEs available. Furthermore, the routing algorithm should block operations up-front, if none of the instances have available SQEs. Once an SQE is allocated, \glspl{thrd} insert the \io request information, and keep track of the SQE index and the instance it belongs to. Once an SQE is allocated, \ats insert the \io request information, and keep track of the SQE index and the instance it belongs to. Once an SQE is filled in, it is added to the submission ring buffer, an operation that is not thread-safe, and then the kernel must be notified using the @io_uring_enter@ system call. Since multiple SQEs can be submitted to the kernel at once, it is important to strike a balance between batching and latency. Operations that are ready to be submitted should be batched together in few system calls, but at the same time, operations should not be left pending for long period of times before being submitted. Balancing submission can be handled by either designating one of the submitting \glspl{thrd} as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \gls{thrd} mentioned later in this section. Ideally, when multiple \glspl{thrd} attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \glspl{thrd} is designated to do the system call on behalf of the others, called the \newterm{submitter}. Balancing submission can be handled by either designating one of the submitting \ats as the being responsible for the system call for the current batch of SQEs or by having some other party regularly submitting all ready SQEs, \eg, the poller \at mentioned later in this section. Ideally, when multiple \ats attempt to submit operations to the same @io_uring@ instance, all requests should be batched together and one of the \ats is designated to do the system call on behalf of the others, called the \newterm{submitter}. However, in practice, \io requests must be handed promptly so there is a need to guarantee everything missed by the current submitter is seen by the next one. Indeed, as long as there is a next'' submitter, \glspl{thrd} submitting new \io requests can move on, knowing that some future system call includes their request. Indeed, as long as there is a next'' submitter, \ats submitting new \io requests can move on, knowing that some future system call includes their request. Once the system call is done, the submitter must also free SQEs so that the allocator can reused them. Finally, the completion side is much simpler since the @io_uring@ system-call enforces a natural synchronization point. Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \glspl{thrd}. Polling simply needs to regularly do the system call, go through the produced CQEs and communicate the result back to the originating \ats. Since CQEs only own a signed 32 bit result, in addition to the copy of the @user_data@ field, all that is needed to communicate the result is a simple future~\cite{wiki:future}. If the submission side does not designate submitters, polling can also submit all SQEs as it is polling events. A simple approach to polling is to allocate a \gls{thrd} per @io_uring@ instance and simply let the poller \glspl{thrd} poll their respective instances when scheduled. A simple approach to polling is to allocate a \at per @io_uring@ instance and simply let the poller \ats poll their respective instances when scheduled. With the pool of SEQ instances approach, the big advantage is that it is fairly flexible. It does not impose restrictions on what \glspl{thrd} submitting \io operations can and cannot do between allocations and submissions. It does not impose restrictions on what \ats submitting \io operations can and cannot do between allocations and submissions. It also can gracefully handle running out of resources, SQEs or the kernel returning @EBUSY@. The down side to this approach is that many of the steps used for submitting need complex synchronization to work properly. The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \glspl{thrd} are already queued up waiting for SQEs and handle SQEs being freed. The routing and allocation algorithm needs to keep track of which ring instances have available SQEs, block incoming requests if no instance is available, prevent barging if \ats are already queued up waiting for SQEs and handle SQEs being freed. The submission side needs to safely append SQEs to the ring buffer, correctly handle chains, make sure no SQE is dropped or left pending forever, notify the allocation side when SQEs can be reused, and handle the kernel returning @EBUSY@. All this synchronization has a significant cost, and compared to the private-instance approach, this synchronization is entirely overhead. In this approach, each cluster, see Figure~\ref{fig:system}, owns a pool of @io_uring@ instances managed by an \newterm{arbiter}. When a \gls{thrd} attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. This instance is now bound to the \gls{proc} the \gls{thrd} is running on. When a \at attempts to issue an \io operation, it ask for an instance from the arbiter and issues requests to that instance. This instance is now bound to the \gls{proc} the \at is running on. This binding is kept until the arbiter decides to revoke it, taking back the instance and reverting the \gls{proc} to its initial state with respect to \io. This tight coupling means that synchronization can be minimal since only one \gls{proc} can use the instance at a time, akin to the private instances approach. \item The current \gls{proc} does not hold an instance. \item The current instance does not have sufficient SQEs to satisfy the request. \item The current \gls{proc} has a wrong instance, this happens if the submitting \gls{thrd} context-switched between allocation and submission, called \newterm{external submissions}. \item The current \gls{proc} has a wrong instance, this happens if the submitting \at context-switched between allocation and submission, called \newterm{external submissions}. \end{enumerate} However, even when the arbiter is not directly needed, \glspl{proc} need to make sure that their instance ownership is not being revoked, which is accomplished by a lock-\emph{less} handshake.\footnote{