Changeset 305cd5c for doc/theses


Ignore:
Timestamp:
Sep 22, 2020, 1:24:26 PM (4 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
2a658e9
Parents:
fe94e708 (diff), 0a945fd (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/thierry_delisle_PhD
Files:
16 added
1 deleted
3 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/.gitignore

    rfe94e708 r305cd5c  
    1111comp_II/comp_II.pdf
    1212comp_II/comp_II.ps
     13comp_II/presentation.pdf
     14
     15thesis/build/
     16thesis/fig/*.fig.bak
     17thesis/thesis.pdf
     18thesis/thesis.ps
    1319
    1420!Makefile
  • doc/theses/thierry_delisle_PhD/comp_II/comp_II.tex

    rfe94e708 r305cd5c  
    6060\section{Introduction}
    6161\subsection{\CFA and the \CFA concurrency package}
    62 \CFA\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
     62\CFA~\cite{Moss18} is a modern, polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
    6363It aims to add high-productivity features while maintaining the predictable performance of C.
    64 As such, concurrency in \CFA\cite{Delisle19} aims to offer simple and safe high-level tools while still allowing performant code.
    65 \CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} in order to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programming.
     64As such, concurrency in \CFA~\cite{Delisle19} aims to offer simple and safe high-level tools while still allowing performant code.
     65\CFA concurrent code is written in the synchronous programming paradigm but uses \glspl{uthrd} to achieve the simplicity and maintainability of synchronous programming without sacrificing the efficiency of asynchronous programming.
    6666As such, the \CFA \newterm{scheduler} is a preemptive user-level scheduler that maps \glspl{uthrd} onto \glspl{kthrd}.
    6767
     68\subsection{Scheduling}
    6869\newterm{Scheduling} occurs when execution switches from one thread to another, where the second thread is implicitly chosen by the scheduler.
    69 This scheduling is an indirect handoff, as opposed to generators and coroutines which explicitly switch to the next generator and coroutine respectively.
     70This scheduling is an indirect handoff, as opposed to generators and coroutines that explicitly switch to the next generator and coroutine respectively.
    7071The cost of switching between two threads for an indirect handoff has two components:
    7172\begin{enumerate}
     
    7576and the cost of scheduling, \ie deciding which thread to run next among all the threads ready to run.
    7677\end{enumerate}
    77 The first cost is generally constant and fixed\footnote{Affecting the constant context-switch cost is whether it is done in one step, after the scheduling, or in two steps, context-switching to a third fixed thread before scheduling.}, while the scheduling cost can vary based on the system state.
    78 Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning, however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost now also depends on contention.
     78The first cost is generally constant\footnote{Affecting the constant context-switch cost is whether it is done in one step, where the first thread schedules the second, or in two steps, where the first thread context switches to a third scheduler thread.}, while the scheduling cost can vary based on the system state.
     79Adding multiple \glspl{kthrd} does not fundamentally change the scheduler semantics or requirements, it simply adds new correctness requirements, \ie \newterm{linearizability}\footnote{Meaning however fast the CPU threads run, there is an equivalent sequential order that gives the same result.}, and a new dimension to performance: scalability, where scheduling cost also depends on contention.
    7980The more threads switch, the more the administration cost of scheduling becomes noticeable.
    8081It is therefore important to build a scheduler with the lowest possible cost and latency.
    8182Another important consideration is \newterm{fairness}.
    8283In principle, scheduling should give the illusion of perfect fairness, where all threads ready to run are running \emph{simultaneously}.
     84In practice, there can be advantages to unfair scheduling, similar to the express cash register at a grocery store.
    8385While the illusion of simultaneity is easier to reason about, it can break down if the scheduler allows too much unfairness.
    8486Therefore, the scheduler should offer as much fairness as needed to guarantee eventual progress, but use unfairness to help performance.
    85 In practice, threads must wait in turn but there can be advantages to unfair scheduling, similar to the express cash register at a grocery store.
    86 
    87 The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good performance.
     87
     88\subsection{Research Goal}
     89The goal of this research is to produce a scheduler that is simple for programmers to understand and offers good general performance.
    8890Here understandability does not refer to the API but to how much scheduling concerns programmers need to take into account when writing a \CFA concurrent package.
    89 Therefore, the main goal of this proposal is :
     91Therefore, the main consequence of this goal is :
    9092\begin{quote}
    9193The \CFA scheduler should be \emph{viable} for \emph{any} workload.
    9294\end{quote}
    9395
    94 For a general-purpose scheduler, it is impossible to produce an optimal algorithm as it would require knowledge of the future behaviour of threads.
    95 As such, scheduling performance is generally either defined by the best-case scenario, \ie a workload to which the scheduler is tailored, or the worst-case scenario, \ie the scheduler behaves no worse than \emph{X}.
     96For a general-purpose scheduler, it is impossible to produce an optimal algorithm as that requires knowledge of the future behaviour of threads.
     97As such, scheduling performance is generally either defined by a best-case scenario, \ie a workload to which the scheduler is tailored, or a worst-case scenario, \ie the scheduler behaves no worse than \emph{X}.
    9698For this proposal, the performance is evaluated using the second approach to allow \CFA programmers to rely on scheduling performance.
    9799Because there is no optimal scheduler, ultimately \CFA may allow programmers to write their own scheduler; but that is not the subject of this proposal, which considers only the default scheduler.
     
    103105        \item creating an abstraction layer over the operating system to handle kernel-threads spinning unnecessarily,
    104106        \item scheduling blocking I/O operations,
    105         \item and writing sufficient library tools to allow developers to indirectly use the scheduler, either through tuning knobs or replacing the default scheduler.
     107        \item and writing sufficient library tools to allow developers to indirectly use the scheduler, either through tuning knobs in the default scheduler or replacing the default scheduler.
    106108\end{enumerate}
    107109
     
    119121\paragraph{Performance} The performance of a scheduler can generally be measured in terms of scheduling cost, scalability and latency.
    120122\newterm{Scheduling cost} is the cost to switch from one thread to another, as mentioned above.
    121 For simple applications, where a single kernel thread does most of the scheduling, it is generally the dominating cost.
    122 \newterm{Scalability} is the cost of adding multiple kernel threads because it increases the time for context switching because of contention by multiple threads accessing shared resources, \eg the ready queue.
     123For compute-bound concurrent applications with little context switching, the scheduling cost is negligible.
     124For applications with high context-switch rates, scheduling cost can begin to dominating the cost.
     125\newterm{Scalability} is the cost of adding multiple kernel threads.
     126It can increase the time for scheduling because of contention from the multiple threads accessing shared resources, \eg a single ready queue.
    123127Finally, \newterm{tail latency} is service delay and relates to thread fairness.
    124 Specifically, latency measures how long a thread waits to run once scheduled and is evaluated in the worst case.
     128Specifically, latency measures how long a thread waits to run once scheduled and is evaluated by the worst case.
    125129The \CFA scheduler should offer good performance for all three metrics.
    126130
     
    128132\newterm{Eventual progress} guarantees every scheduled thread is eventually run, \ie prevent starvation.
    129133As a hard requirement, the \CFA scheduler must guarantee eventual progress, otherwise the above-mentioned illusion of simultaneous execution is broken and the scheduler becomes much more complex to reason about.
    130 \newterm{Predictability} and \newterm{reliability} mean similar workloads achieve similar performance and programmer execution intuition is respected.
    131 For example, a thread that yields aggressively should not run more often than other tasks.
     134\newterm{Predictability} and \newterm{reliability} mean similar workloads achieve similar performance so programmer execution intuition is respected.
     135For example, a thread that yields aggressively should not run more often than other threads.
    132136While this is intuitive, it does not hold true for many work-stealing or feedback based schedulers.
    133 The \CFA scheduler must guarantee eventual progress and should be predictable and offer reliable performance.
     137The \CFA scheduler must guarantee eventual progress, should be predictable, and offer reliable performance.
    134138
    135139\paragraph{Efficiency} Finally, efficient usage of CPU resources is also an important requirement and is discussed in depth towards the end of the proposal.
    136 \newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run, and conversely, use all CPUs available when the workload can benefit from it.
     140\newterm{Efficiency} means avoiding using CPU cycles when there are no threads to run (to conserve energy), and conversely, using as many available CPU cycles when the workload can benefit from it.
    137141Balancing these two states is where the complexity lies.
    138142The \CFA scheduler should be efficient with respect to the underlying (shared) computer.
     
    146150\begin{enumerate}
    147151        \item Threads live long enough for useful feedback information to be gathered.
    148         \item Threads belong to multiple users so fairness across threads is insufficient.
     152        \item Threads belong to multiple users so fairness across users is important.
    149153\end{enumerate}
    150154
     
    159163In the case of the \CFA scheduler, every thread runs in the same user space and is controlled by the same user.
    160164Fairness across users is therefore a given and it is then possible to safely ignore the possibility that threads are malevolent.
    161 This approach allows for a much simpler fairness metric and in this proposal \emph{fairness} is defined as: when multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready queue.
     165This approach allows for a much simpler fairness metric, and in this proposal, \emph{fairness} is defined as:
     166\begin{quote}
     167When multiple threads are cycling through the system, the total ordering of threads being scheduled, \ie pushed onto the ready queue, should not differ much from the total ordering of threads being executed, \ie popped from the ready queue.
     168\end{quote}
    162169
    163170Since feedback is not necessarily feasible within the lifetime of all threads and a simple fairness metric can be used, the scheduling strategy proposed for the \CFA runtime does not use per-threads feedback.
     
    169176Threads with equal priority are scheduled using a secondary strategy, often something simple like round robin or FIFO.
    170177A consequence of priority is that, as long as there is a thread with a higher priority that desires to run, a thread with a lower priority does not run.
    171 This possible starving of threads can dramatically increase programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems.
     178The potential for thread starvation dramatically increases programming complexity since starving threads and priority inversion (prioritizing a lower priority thread) can both lead to serious problems.
    172179
    173180An important observation is that threads do not need to have explicit priorities for problems to occur.
    174 Indeed, any system with multiple ready queues that attempts to exhaust one queue before accessing the other queues, essentially provide implicit priority, which can encounter starvation problems.
     181Indeed, any system with multiple ready queues that attempts to exhaust one queue before accessing the other queues, essentially provides implicit priority, which can encounter starvation problems.
    175182For example, a popular scheduling strategy that suffers from implicit priorities is work stealing.
    176183\newterm{Work stealing} is generally presented as follows:
     
    180187        \item If a processor's ready queue is empty, attempt to run threads from some other processor's ready queue.
    181188\end{enumerate}
    182 
    183189In a loaded system\footnote{A \newterm{loaded system} is a system where threads are being run at the same rate they are scheduled.}, if a thread does not yield, block, or preempt for an extended period of time, threads on the same processor's list starve if no other processors exhaust their list.
    184190
    185 Since priorities can be complex for programmers to incorporate into their execution intuition, the scheduling strategy proposed for the \CFA runtime does not use a strategy with either implicit or explicit thread priorities.
     191Since priorities can be complex for programmers to incorporate into their execution intuition, the \CFA scheduling strategy does not provided explicit priorities and attempts to eliminate implicit priorities.
    186192
    187193\subsection{Schedulers without feedback or priorities}
     
    191197Thankfully, strict FIFO is not needed for sufficient fairness.
    192198Since concurrency is inherently non-deterministic, fairness concerns in scheduling are only a problem if a thread repeatedly runs before another thread can run.
    193 Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a specific thread will \emph{eventually} run.
     199Some relaxation is possible because non-determinism means programmers already handle ordering problems to produce correct code and hence rely on weak guarantees, \eg that a thread \emph{eventually} runs.
    194200Since some reordering does not break correctness, the FIFO fairness guarantee can be significantly relaxed without causing problems.
    195201For this proposal, the target guarantee is that the \CFA scheduler provides \emph{probable} FIFO ordering, which allows reordering but makes it improbable that threads are reordered far from their position in total ordering.
    196202
    197203The \CFA scheduler fairness is defined as follows:
    198 \begin{itemize}
    199         \item Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$.
    200 \end{itemize}
     204\begin{quote}
     205Given two threads $X$ and $Y$, the odds that thread $X$ runs $N$ times \emph{after} thread $Y$ is scheduled but \emph{before} it is run, decreases exponentially with regard to $N$.
     206\end{quote}
    201207While this is not a bounded guarantee, the probability that unfairness persist for long periods of times decreases exponentially, making persisting unfairness virtually impossible.
    202208
     
    210216The described queue uses an array of underlying strictly FIFO queues as shown in Figure~\ref{fig:base}\footnote{For this section, the number of underlying queues is assumed to be constant.
    211217Section~\ref{sec:resize} discusses resizing the array.}.
    212 Pushing new data is done by selecting one of these underlying queues at random, recording a timestamp for the operation and pushing to the selected queue.
     218Pushing new data is done by selecting one of the underlying queues at random, recording a timestamp for the operation, and pushing to the selected queue.
    213219Popping is done by selecting two queues at random and popping from the queue with the oldest timestamp.
    214 A higher number of underlying queues lead to less contention on each queue and therefore better performance.
    215 In a loaded system, it is highly likely the queues are non-empty, \ie several tasks are on each of the underlying queues.
    216 This means that selecting a queue at random to pop from is highly likely to yield a queue with available items.
     220A higher number of underlying queues leads to less contention on each queue and therefore better performance.
     221In a loaded system, it is highly likely the queues are non-empty, \ie several threads are on each of the underlying queues.
     222For this case, selecting a queue at random to pop from is highly likely to yield a queue with available items.
    217223In Figure~\ref{fig:base}, ignoring the ellipsis, the chances of getting an empty queue is 2/7 per pick, meaning two random picks yield an item approximately 9 times out of 10.
    218224
     
    221227                \input{base.pstex_t}
    222228        \end{center}
    223         \caption{Relaxed FIFO list at the base of the scheduler: an array of strictly FIFO lists.
    224         The timestamp is in all nodes and cell arrays.}
     229        \caption{Loaded relaxed FIFO list base on an array of strictly FIFO lists.
     230        A timestamp appears in each node and array cell.}
    225231        \label{fig:base}
    226232\end{figure}
     
    230236                \input{empty.pstex_t}
    231237        \end{center}
    232         \caption{``More empty'' state of the queue: the array contains many empty cells.}
     238        \caption{Underloaded relaxed FIFO list where the array contains many empty cells.}
    233239        \label{fig:empty}
    234240\end{figure}
    235241
    236 When the ready queue is \emph{more empty}, \ie several of the queues are empty, selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation.
     242In an underloaded system, several of the queues are empty, so selecting a random queue for popping is less likely to yield a successful selection and more attempts are needed, resulting in a performance degradation.
    237243Figure~\ref{fig:empty} shows an example with fewer elements, where the chances of getting an empty queue is 5/7 per pick, meaning two random picks yield an item only half the time.
    238244Since the ready queue is not empty, the pop operation \emph{must} find an element before returning and therefore must retry.
     
    262268\end{table}
    263269
    264 Performance can be improved in case~D (Table~\ref{tab:perfcases}) by adding information to help processors find which inner queues are used.
     270Performance can be improved in Table~\ref{tab:perfcases} case~D by adding information to help processors find which inner queues are used.
    265271This addition aims to avoid the cost of retrying the pop operation but does not affect contention on the underlying queues and can incur some management cost for both push and pop operations.
    266272The approach used to encode this information can vary in density and be either global or local.
     
    273279With a multi-word bitmask, this maximum limit can be increased arbitrarily, but it is not possible to check if the queue is empty by reading the bitmask atomically.
    274280
    275 Finally, a dense bitmap, either single or multi-word, causes additional problems in case C (Table 1), because many processors are continuously scanning the bitmask to find the few available threads.
     281Finally, a dense bitmap, either single or multi-word, causes additional problems in Table~\ref{tab:perfcases} case C, because many processors are continuously scanning the bitmask to find the few available threads.
    276282This increased contention on the bitmask(s) reduces performance because of cache misses after updates and the bitmask is updated more frequently by the scanning processors racing to read and/or update that information.
    277283This increased update frequency means the information in the bitmask is more often stale before a processor can use it to find an item, \ie mask read says there are available user threads but none on queue.
     
    279285\begin{figure}
    280286        \begin{center}
    281                 {\resizebox{0.8\textwidth}{!}{\input{emptybit}}}
    282         \end{center}
    283         \caption{``More empty'' queue with added bitmask to indicate which array cells have items.}
     287                {\resizebox{0.73\textwidth}{!}{\input{emptybit}}}
     288        \end{center}
     289        \vspace*{-5pt}
     290        \caption{Underloaded queue with added bitmask to indicate which array cells have items.}
    284291        \label{fig:emptybit}
     292        \begin{center}
     293                {\resizebox{0.73\textwidth}{!}{\input{emptytree}}}
     294        \end{center}
     295        \vspace*{-5pt}
     296        \caption{Underloaded queue with added binary search tree indicate which array cells have items.}
     297        \label{fig:emptytree}
     298        \begin{center}
     299                {\resizebox{0.9\textwidth}{!}{\input{emptytls}}}
     300        \end{center}
     301        \vspace*{-5pt}
     302        \caption{Underloaded queue with added per processor bitmask to indicate which array cells have items.}
     303        \label{fig:emptytls}
    285304\end{figure}
    286305
    287 Figure~\ref{fig:emptytree} shows another approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}\footnote{This particular paper seems to be patented in the US.
    288 How does that affect \CFA? Can I use it in my work?}.
    289 However, this approach may lead to poorer performance in case~B (Table~\ref{tab:perfcases}) due to the inherent pointer chasing cost and already low contention cost in that case.
    290 
    291 \begin{figure}
    292         \begin{center}
    293                 {\resizebox{0.8\textwidth}{!}{\input{emptytree}}}
    294         \end{center}
    295         \caption{``More empty'' queue with added binary search tree indicate which array cells have items.}
    296         \label{fig:emptytree}
    297 \end{figure}
    298 
    299 Finally, a third approach is to use dense information, similar to the bitmap, but have each thread keep its own independent copy of it.
     306Figure~\ref{fig:emptytree} shows an approach using a hierarchical tree data-structure to reduce contention and has been shown to work in similar cases~\cite{ellen2007snzi}.
     307However, this approach may lead to poorer performance in Table~\ref{tab:perfcases} case~B due to the inherent pointer chasing cost and already low contention cost in that case.
     308
     309Figure~\ref{fig:emptytls} shows an approach using dense information, similar to the bitmap, but have each thread keep its own independent copy of it.
    300310While this approach can offer good scalability \emph{and} low latency, the liveliness of the information can become a problem.
    301 In the simple cases, local copies of which underlying queues are empty can become stale and end-up not being useful for the pop operation.
     311In the simple cases, local copies can become stale and end-up not being useful for the pop operation.
    302312A more serious problem is that reliable information is necessary for some parts of this algorithm to be correct.
    303313As mentioned in this section, processors must know \emph{reliably} whether the list is empty or not to decide if they can return \texttt{NULL} or if they must keep looking during a pop operation.
    304314Section~\ref{sec:sleep} discusses another case where reliable information is required for the algorithm to be correct.
    305315
    306 \begin{figure}
    307         \begin{center}
    308                 \input{emptytls}
    309         \end{center}
    310         \caption{``More empty'' queue with added per processor bitmask to indicate which array cells have items.}
    311         \label{fig:emptytls}
    312 \end{figure}
    313 
    314316There is a fundamental tradeoff among these approach.
    315 Dense global information about empty underlying queues helps zero-contention cases at the cost of high-contention case.
    316 Sparse global information helps high-contention cases but increases latency in zero-contention-cases, to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node.
     317Dense global information about empty underlying queues helps zero-contention cases at the cost of the high-contention case.
     318Sparse global information helps high-contention cases but increases latency in zero-contention cases to read and ``aggregate'' the information\footnote{Hierarchical structures, \eg binary search tree, effectively aggregate information but follow pointer chains, learning information at each node.
    317319Similarly, other sparse schemes need to read multiple cachelines to acquire all the information needed.}.
    318 Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases. However the information can become stale making it difficult to use to ensure correctness.
     320Finally, dense local information has both the advantages of low latency in zero-contention cases and scalability in high-contention cases.
     321However, the information can become stale making it difficult to use to ensure correctness.
    319322The fact that these solutions have these fundamental limits suggest to me a better solution that attempts to combine these properties in an interesting way.
    320323Also, the lock discussed in Section~\ref{sec:resize} allows for solutions that adapt to the number of processors, which could also prove useful.
     
    323326
    324327How much scalability is actually needed is highly debatable.
    325 \emph{libfibre}\cite{libfibre} has compared favourably to other schedulers in webserver tests\cite{Karsten20} and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask.
     328\emph{libfibre}~\cite{libfibre} has compared favourably to other schedulers in webserver tests~\cite{Karsten20} and uses a single atomic counter in its scheduling algorithm similarly to the proposed bitmask.
    326329As such, the single atomic instruction on a shared cacheline may be sufficiently performant.
    327330
    328 I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single int representing a thread and intrusive data fields.
    329 Using this prototype, I ran preliminary performance experiments that confirm the expected performance in Table~\ref{tab:perfcases}.
    330 However, these experiments only offer a hint at the actual performance of the scheduler since threads form more complex operations than simple integer nodes, \eg threads are not independent of each other, when a thread blocks some other thread must intervene to wake it.
     331I have built a prototype of this ready queue in the shape of a data queue, \ie nodes on the queue are structures with a single $int$ representing a thread and intrusive data fields.
     332Using this prototype, preliminary performance experiments confirm the expected performance in Table~\ref{tab:perfcases}.
     333However, these experiments only offer a hint at the actual performance of the scheduler since threads are involved in more complex operations, \eg threads are not independent of each other: when a thread blocks some other thread must intervene to wake it.
    331334
    332335I have also integrated this prototype into the \CFA runtime, but have not yet created performance experiments to compare results, as creating one-to-one comparisons between the prototype and the \CFA runtime will be complex.
     
    345348Threads on a cluster are always scheduled on one of the processors of the cluster.
    346349Currently, the runtime handles dynamically adding and removing processors from clusters at any time.
    347 Since this is part of the existing design, the proposed scheduler must also support this behaviour.
     350Since this feature is part of the existing design, the proposed scheduler must also support this behaviour.
    348351However, dynamically resizing a cluster is considered a rare event associated with setup, tear down and major configuration changes.
    349352This assumption is made both in the design of the proposed scheduler as well as in the original design of the \CFA runtime system.
    350353As such, the proposed scheduler must honour the correctness of this behaviour but does not have any performance objectives with regard to resizing a cluster.
    351 How long adding or removing processors take and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long periods of times.
     354That is, the time to add or remove processors and how much this disrupts the performance of other threads is considered a secondary concern since it should be amortized over long periods of times.
    352355However, as mentioned in Section~\ref{sec:queue}, contention on the underlying queues can have a direct impact on performance.
    353356The number of underlying queues must therefore be adjusted as the number of processors grows or shrinks.
     
    371374
    372375There are possible alternatives to the reader-writer lock solution.
    373 This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject\cite{michael2004hazard, brown2015reclaiming}.
     376This problem is effectively a memory reclamation problem and as such there is a large body of research on the subject~\cite{brown2015reclaiming, michael2004hazard}.
    374377However, the reader-write lock-solution is simple and can be leveraged to solve other problems (\eg processor ordering and memory reclamation of threads), which makes it an attractive solution.
    375378
     
    401404Individual processors always finish scheduling user threads before looking for new work, which means that the last processor to go to sleep cannot miss threads scheduled from inside the cluster (if they do, that demonstrates the ready queue is not linearizable).
    402405However, this guarantee does not hold if threads are scheduled from outside the cluster, either due to an external event like timers and I/O, or due to a user (or kernel) thread migrating from a different cluster.
    403 In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} write \CFA code that leads to a deadlock.}.
     406In this case, missed signals can lead to the cluster deadlocking\footnote{Clusters should only deadlock in cases where a \CFA programmer \emph{actually} writes \CFA code that leads to a deadlock.}.
    404407Therefore, it is important that the scheduling of threads include a mechanism where signals \emph{cannot} be missed.
    405408For performance reasons, it can be advantageous to have a secondary mechanism that allows signals to be missed in cases where it cannot lead to a deadlock.
    406 To be safe, this process must include a ``handshake'' where it is guaranteed that either~: the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor.
     409To be safe, this process must include a ``handshake'' where it is guaranteed that either:
     410\begin{enumerate}
     411\item
     412the sleeping processor notices that a user thread is scheduled after the sleeping processor signalled its intent to block or
     413\item
     414code scheduling threads sees the intent to sleep before scheduling and be able to wake-up the processor.
     415\end{enumerate}
    407416This matter is complicated by the fact that pthreads and Linux offer few tools to implement this solution and no guarantee of ordering of threads waking up for most of these tools.
    408417
    409418Another important issue is avoiding kernel threads sleeping and waking frequently because there is a significant operating-system cost.
    410 This scenario happens when a program oscillates between high and low activity, needing most and then fewer processors.
     419This scenario happens when a program oscillates between high and low activity, needing most and then few processors.
    411420A possible partial solution is to order the processors so that the one which most recently went to sleep is woken up.
    412421This allows other sleeping processors to reach deeper sleep state (when these are available) while keeping ``hot'' processors warmer.
     
    417426Processors that are unnecessarily unblocked lead to unnecessary contention, CPU usage, and power consumption, while too many sleeping processors can lead to suboptimal throughput.
    418427Furthermore, transitions from sleeping to awake and vice versa also add unnecessary latency.
    419 There is already a wealth of research on the subject\cite{schillings1996engineering, wiki:thunderherd} and I may use an existing approach for the idle-sleep heuristic in this project, \eg\cite{Karsten20}.
     428There is already a wealth of research on the subject~\cite{schillings1996engineering, wiki:thunderherd} and I may use an existing approach for the idle-sleep heuristic in this project, \eg~\cite{Karsten20}.
    420429
    421430\subsection{Asynchronous I/O}
     
    432441an event-engine to (de)multiplex the operations,
    433442\item
    434 and a synchronous interface for users to use.
     443and a synchronous interface for users.
    435444\end{enumerate}
    436445None of these components currently exist in \CFA and I will need to build all three for this project.
    437446
    438 \paragraph{OS Abstraction}
    439 One fundamental part for converting blocking I/O operations into non-blocking ones is having an underlying asynchronous I/O interface to direct the I/O operations.
     447\paragraph{OS Asynchronous Abstraction}
     448One fundamental part for converting blocking I/O operations into non-blocking is having an underlying asynchronous I/O interface to direct the I/O operations.
    440449While there exists many different APIs for asynchronous I/O, it is not part of this proposal to create a novel API.
    441450It is sufficient to make one work in the complex context of the \CFA runtime.
    442 \uC uses the $select$\cite{select} as its interface, which handles ttys, pipes and sockets, but not disk.
     451\uC uses the $select$~\cite{select} as its interface, which handles ttys, pipes and sockets, but not disk.
    443452$select$ entails significant complexity and is being replaced in UNIX operating systems, which make it a less interesting alternative.
    444 Another popular interface is $epoll$\cite{epoll}, which is supposed to be cheaper than $select$.
    445 However, $epoll$ also does not handle the file system and anecdotal evidence suggest it has problems with Linux pipes and $TTY$s.
    446 A popular cross-platform alternative is $libuv$\cite{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features).
     453Another popular interface is $epoll$~\cite{epoll}, which is supposed to be cheaper than $select$.
     454However, $epoll$ also does not handle the file system and anecdotal evidence suggest it has problems with Linux pipes and ttys.
     455A popular cross-platform alternative is $libuv$~\cite{libuv}, which offers asynchronous sockets and asynchronous file system operations (among other features).
    447456However, as a full-featured library it includes much more than I need and could conflict with other features of \CFA unless significant effort is made to merge them together.
    448 A very recent alternative that I am investigating is $io_uring$\cite{io_uring}.
     457A very recent alternative that I am investigating is $io_uring$~\cite{io_uring}.
    449458It claims to address some of the issues with $epoll$ and my early investigating suggests that the claim is accurate.
    450 $io_uring$ uses a much more general approach where system calls are registered to a queue and later executed by the kernel, rather than relying on system calls to return an error instead of blocking and subsequently waiting for changes on file descriptors.
    451 I believe this approach allows for fewer problems, \eg the manpage for $open$\cite{open} states:
     459$io_uring$ uses a much more general approach where system calls are registered to a queue and later executed by the kernel, rather than relying on system calls to support returning an error instead of blocking.
     460I believe this approach allows for fewer problems, \eg the manpage for $open$~\cite{open} states:
    452461\begin{quote}
    453462Note that [the $O_NONBLOCK$ flag] has no effect for regular files and block devices;
     
    455464Since $O_NONBLOCK$ semantics might eventually be implemented, applications should not depend upon blocking behaviour when specifying this flag for regular files and block devices.
    456465\end{quote}
    457 This makes approach based on $epoll$/$select$ less reliable since they may not work for every file descriptors.
    458 For this reason, I plan to use $io_uring$ as the OS abstraction for the \CFA runtime unless further work shows problems I haven't encountered yet.
    459 However, only a small subset of the features are available in Ubuntu as of April 2020\cite{wiki:ubuntu-linux}, which will limit performance comparisons.
     466This makes approaches based on $select$/$epoll$ less reliable since they may not work for every file descriptors.
     467For this reason, I plan to use $io_uring$ as the OS abstraction for the \CFA runtime unless further work encounters a fatal problem.
     468However, only a small subset of the features are available in Ubuntu as of April 2020~\cite{wiki:ubuntu-linux}, which will limit performance comparisons.
    460469I do not believe this will affect the comparison result.
    461470
    462471\paragraph{Event Engine}
    463 Laying on top of the asynchronous interface layer is the event engine.
     472Above the OS asynchronous abstraction is the event engine.
    464473This engine is responsible for multiplexing (batching) the synchronous I/O requests into asynchronous I/O requests and demultiplexing the results to appropriate blocked user threads.
    465474This step can be straightforward for simple cases, but becomes quite complex when there are thousands of user threads performing both reads and writes, possibly on overlapping file descriptors.
     
    478487The interface can be novel but it is preferable to match the existing POSIX interface when possible to be compatible with existing code.
    479488Matching allows C programs written using this interface to be transparently converted to \CFA with minimal effort.
    480 Where new functionality is needed, I will create a novel interface to fill gaps and provide advanced features.
     489Where new functionality is needed, I will add novel interface extensions to fill gaps and provide advanced features.
    481490
    482491
     
    485494\section{Discussion}
    486495I believe that runtime system and scheduling are still open topics.
    487 Many ``state of the art'' production frameworks still use single-threaded event loops because of performance considerations, \eg \cite{nginx-design}, and, to my knowledge, no widely available system language offers modern threading facilities.
     496Many ``state of the art'' production frameworks still use single-threaded event loops because of performance considerations, \eg~\cite{nginx-design}, and, to my knowledge, no widely available system language offers modern threading facilities.
    488497I believe the proposed work offers a novel runtime and scheduling package, where existing work only offers fragments that users must assemble themselves when possible.
    489498
  • doc/theses/thierry_delisle_PhD/comp_II/img/system.fig

    rfe94e708 r305cd5c  
    1 #FIG 3.2  Produced by xfig version 3.2.5c
     1#FIG 3.2  Produced by xfig version 3.2.7b
    22Landscape
    33Center
     
    36361 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 4500 3600 15 15 4500 3600 4515 3615
    3737-6
    38 6 3225 4125 4650 4425
    39 6 4350 4200 4650 4350
    40 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 4425 4275 15 15 4425 4275 4440 4290
    41 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 4500 4275 15 15 4500 4275 4515 4290
    42 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 4575 4275 15 15 4575 4275 4590 4290
    43 -6
    44 1 1 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3450 4275 225 150 3450 4275 3675 4425
    45 1 1 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4050 4275 225 150 4050 4275 4275 4425
    46 -6
    47 6 6675 4125 7500 4425
    48 6 7200 4200 7500 4350
    49 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 7275 4275 15 15 7275 4275 7290 4290
    50 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 7350 4275 15 15 7350 4275 7365 4290
    51 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 7425 4275 15 15 7425 4275 7440 4290
    52 -6
    53 1 1 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6900 4275 225 150 6900 4275 7125 4425
    54 -6
    55386 6675 3525 8025 3975
    56392 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 1 0 2
     
    79621 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3975 2850 150 150 3975 2850 4125 2850
    80631 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 7200 2775 150 150 7200 2775 7350 2775
    81 1 3 0 1 0 0 0 0 0 0.000 1 0.0000 2250 4830 30 30 2250 4830 2280 4860
     641 3 0 1 0 0 0 0 0 0.000 1 0.0000 2250 4830 30 30 2250 4830 2280 4830
    82651 3 0 1 0 0 0 0 0 0.000 1 0.0000 7200 2775 30 30 7200 2775 7230 2805
    83661 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3525 3600 150 150 3525 3600 3675 3600
    84 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3875 4800 100 100 3875 4800 3975 4800
    85 1 1 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4650 4800 150 75 4650 4800 4800 4875
     671 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4625 4838 100 100 4625 4838 4725 4838
    86682 2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
    8769         2400 4200 2400 3750 1950 3750 1950 4200 2400 4200
     
    153135        1 1 1.00 45.00 90.00
    154136         7875 3750 7875 2325 7200 2325 7200 2550
     1372 2 1 1 -1 -1 0 0 -1 3.000 0 0 0 0 0 5
     138         6975 4950 6750 4950 6750 4725 6975 4725 6975 4950
    1551392 2 0 1 -1 -1 0 0 -1 0.000 0 0 0 0 0 5
    156140         5850 4950 5850 4725 5625 4725 5625 4950 5850 4950
    157 2 2 1 1 -1 -1 0 0 -1 3.000 0 0 0 0 0 5
    158          6975 4950 6750 4950 6750 4725 6975 4725 6975 4950
    159 4 1 -1 0 0 0 10 0.0000 2 105 720 5550 4425 Processors\001
    160 4 1 -1 0 0 0 10 0.0000 2 120 1005 4200 3225 Blocked Tasks\001
    161 4 1 -1 0 0 0 10 0.0000 2 150 870 4200 3975 Ready Tasks\001
    162 4 1 -1 0 0 0 10 0.0000 2 135 1095 7350 1725 Other Cluster(s)\001
    163 4 1 -1 0 0 0 10 0.0000 2 105 840 4650 1725 User Cluster\001
    164 4 1 -1 0 0 0 10 0.0000 2 150 615 2175 3675 Manager\001
    165 4 1 -1 0 0 0 10 0.0000 2 105 990 2175 3525 Discrete-event\001
    166 4 1 -1 0 0 0 10 0.0000 2 135 795 2175 4350 preemption\001
    167 4 0 -1 0 0 0 10 0.0000 2 150 1290 2325 4875 generator/coroutine\001
    168 4 0 -1 0 0 0 10 0.0000 2 120 270 4050 4875 task\001
    169 4 0 -1 0 0 0 10 0.0000 2 105 450 7050 4875 cluster\001
    170 4 0 -1 0 0 0 10 0.0000 2 105 660 5925 4875 processor\001
    171 4 0 -1 0 0 0 10 0.0000 2 105 555 4875 4875 monitor\001
     1414 1 -1 0 0 0 10 0.0000 2 135 900 5550 4425 Processors\001
     1424 1 -1 0 0 0 10 0.0000 2 165 1170 4200 3975 Ready Threads\001
     1434 1 -1 0 0 0 10 0.0000 2 165 1440 7350 1725 Other Cluster(s)\001
     1444 1 -1 0 0 0 10 0.0000 2 135 1080 4650 1725 User Cluster\001
     1454 1 -1 0 0 0 10 0.0000 2 165 630 2175 3675 Manager\001
     1464 1 -1 0 0 0 10 0.0000 2 135 1260 2175 3525 Discrete-event\001
     1474 1 -1 0 0 0 10 0.0000 2 150 900 2175 4350 preemption\001
     1484 0 -1 0 0 0 10 0.0000 2 135 630 7050 4875 cluster\001
     1494 1 -1 0 0 0 10 0.0000 2 135 1350 4200 3225 Blocked Threads\001
     1504 0 -1 0 0 0 10 0.0000 2 135 540 4800 4875 thread\001
     1514 0 -1 0 0 0 10 0.0000 2 120 810 5925 4875 processor\001
     1524 0 -1 0 0 0 10 0.0000 2 165 1710 2325 4875 generator/coroutine\001
Note: See TracChangeset for help on using the changeset viewer.