Ignore:
Timestamp:
Aug 5, 2022, 4:20:16 PM (2 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, ast-experimental, master, pthread-emulation
Children:
181036c
Parents:
8040286 (diff), 878be178 (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc

Location:
doc/theses/thierry_delisle_PhD/thesis/text
Files:
4 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/thierry_delisle_PhD/thesis/text/front.tex

    r8040286 r62c5a55  
    129129which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
    130130Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
    131 When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasted CPU resources.
     131When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
    132132Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
    133133otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
     
    138138Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
    139139
    140 After testing and selecting specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system.
     140After examining, selecting and testing specific approaches to these scheduling issues, a complete implementation was created and tested in the \CFA (C-for-all) runtime system.
    141141\CFA is a modern extension of C using user-level threading as its fundamental threading model.
    142142As one of its primary goals, \CFA aims to offer increased safety and productivity without sacrificing performance.
  • doc/theses/thierry_delisle_PhD/thesis/text/intro.tex

    r8040286 r62c5a55  
    11\chapter{Introduction}\label{intro}
    2 \section{\CFA programming language}
    32
    4 The \CFA programming language~\cite{cfa:frontpage,cfa:typesystem} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
    5 Among its productivity features, \CFA supports user-level threading~\cite{Delisle21} allowing programmers to write modern concurrent and parallel programs.
    6 My previous master's thesis on concurrent in \CFA focused on features and interfaces.
    7 This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
    8 Specifically, this work concentrates on scheduling and \glsxtrshort{io}.
    9 Prior to this work, the \CFA runtime used a strict \glsxtrshort{fifo} \gls{rQ} and no \glsxtrshort{io} capabilities at the user-thread level\footnote{C supports \glsxtrshort{io} capabilities at the kernel level, which means blocking operations block kernel threads where blocking user-level threads whould be more appropriate for \CFA.}.
     3\Gls{uthrding} (M:N) is gaining popularity over kernel-level threading (1:1) in many programming languages.
     4The user threading approach is often a better mechanism to express complex concurrent applications by efficiently running 10,000+ threads on multi-core systems.
     5Indeed, over-partitioning into small work-units with user threading significantly eases load bal\-ancing, while simultaneously providing advanced synchronization and mutual exclusion capabilities.
     6To manage these high levels of concurrency, the underlying runtime must efficiently schedule many user threads across a few kernel threads;
     7which begs of the question of how many kernel threads are needed and should the number be dynamically reevaluated.
     8Furthermore, scheduling must prevent kernel threads from blocking, otherwise user-thread parallelism drops.
     9When user-threading parallelism does drop, how and when should idle kernel-threads be put to sleep to avoid wasting CPU resources.
     10Finally, the scheduling system must provide fairness to prevent a user thread from monopolizing a kernel thread;
     11otherwise other user threads can experience short/long term starvation or kernel threads can deadlock waiting for events to occur on busy kernel threads.
    1012
    11 As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
    12 While \CFA is released, supporting older versions of Linux ($<$~Ubuntu 16.04) and gcc/clang compilers ($<$~gcc 6.0) is not a goal of this work.
     13This thesis analyses multiple scheduler systems, where each system attempts to fulfill the necessary requirements for \gls{uthrding}.
     14The predominant technique for managing high levels of concurrency is sharding the ready-queue with one queue per kernel-thread and using some form of work stealing/sharing to dynamically rebalance workload shifts.
     15Preventing kernel blocking is accomplish by transforming kernel locks and I/O operations into user-level operations that do not block the kernel thread or spin up new kernel threads to manage the blocking.
     16Fairness is handled through preemption and/or ad-hoc solutions, which leads to coarse-grained fairness with some pathological cases.
     17
     18After examining, testing and selecting specific approaches to these scheduling issues, a completely new scheduler was created and tested in the \CFA (C-for-all) user-threading runtime-system.
     19The goal of the new scheduler to offer increased safety and productivity without sacrificing performance.
     20The quality of the new scheduler is demonstrated by comparing it with other user-threading work-stealing schedulers with the aim of showing equivalent or better performance while offering better fairness.
     21
     22Chapter~\ref{intro} defines scheduling and its general goals.
     23Chapter~\ref{existing} discusses how scheduler implementations attempt to achieve these goals, but all implementations optimize some workloads better than others.
     24Chapter~\ref{s:CFARuntime} presents the relevant aspects of the \CFA runtime system that have a significant affect on the new scheduler design and implementation.
     25Chapter~\ref{core} analyses different scheduler approaches, while looking for scheduler mechanisms that provide both performance and fairness.
     26Chapter~\ref{s:UserLevelIO} covers the complex mechanisms that must be used to achieve nonblocking I/O to prevent the blocking of \glspl{kthrd}.
     27Chapters~\ref{microbench} and~\ref{macrobench} present micro and macro benchmarks used to evaluate and compare the new scheduler with similar schedulers.
     28
    1329
    1430\section{Scheduling}
    15 Computer systems share multiple resources across many threads of execution, even on single user computers like laptops or smartphones.
    16 On a computer system with multiple processors and work units, there exists the problem of mapping work onto processors in an efficient manner, called \newterm{scheduling}.
    17 These systems are normally \newterm{open}, meaning new work arrives from an external source or is spawned from an existing work unit.
    18 On a computer system, the scheduler takes a sequence of work requests in the form of threads and attempts to complete the work, subject to performance objectives, such as resource utilization.
    19 A general-purpose dynamic-scheduler for an open system cannot anticipate future work requests, so its performance is rarely optimal.
    20 With complete knowledge of arrive order and work, creating an optimal solution still effectively needs solving the bin packing problem\cite{wiki:binpak}.
    21 However, optimal solutions are often not required.
    22 Schedulers do produce excellent solutions, whitout needing optimality, by taking advantage of regularities in work patterns.
     31Computer systems share multiple resources across many threads of execution, even on single-user computers like laptops or smartphones.
     32On a computer system with multiple processors and work units (routines, coroutines, threads, programs, \etc), there exists the problem of mapping many different kinds of work units onto many different kinds of processors in an efficient manner, called \newterm{scheduling}.
     33Scheduling systems are normally \newterm{open}, meaning new work arrives from an external source or is randomly spawned from an existing work unit.
     34In general, work units without threads, like routines and coroutines, are self-scheduling, while work units with threads, like tasks and programs, are scheduled.
     35For scheduled work-units, a scheduler takes a sequence of threads and attempts to run them to completion, subject to shared resource restrictions and utilization.
     36A general-purpose dynamic-scheduler for an open system cannot anticipate (future) work requests, so its performance is rarely optimal.
     37Even with complete knowledge of arrive order and work, creating an optimal solution is a bin packing problem~\cite{wiki:binpak}.
     38However, optimal solutions are often not required: schedulers often produce excellent solutions, without needing optimality, by taking advantage of regularities in work patterns.
    2339
    2440Scheduling occurs at discreet points when there are transitions in a system.
     
    2743\input{executionStates.pstex_t}
    2844\end{center}
    29 These \newterm{state transition}s are initiated in response to events (\Index{interrupt}s):
     45These \newterm{state transition}s are initiated in response to events, \eg blocking, interrupts, errors:
    3046\begin{itemize}
    3147\item
    3248entering the system (new $\rightarrow$ ready)
     49\item
     50scheduler assigns a thread to a computing resource, \eg CPU (ready $\rightarrow$ running)
    3351\item
    3452timer alarm for preemption (running $\rightarrow$ ready)
     
    3654long term delay versus spinning (running $\rightarrow$ blocked)
    3755\item
    38 blocking ends, \ie network or I/O completion (blocked $\rightarrow$ ready)
     56completion of delay, \eg network or I/O completion (blocked $\rightarrow$ ready)
    3957\item
    40 normal completion or error, \ie segment fault (running $\rightarrow$ halted)
    41 \item
    42 scheduler assigns a thread to a resource (ready $\rightarrow$ running)
     58normal completion or error, \eg segment fault (running $\rightarrow$ halted)
    4359\end{itemize}
    44 Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system.
     60Key to scheduling is that a thread cannot bypass the ``ready'' state during a transition so the scheduler maintains complete control of the system, \ie no self-scheduling among threads.
    4561
    4662When the workload exceeds the capacity of the processors, \ie work cannot be executed immediately, it is placed on a queue for subsequent service, called a \newterm{ready queue}.
    4763Ready queues organize threads for scheduling, which indirectly organizes the work to be performed.
    48 The structure of ready queues can take many different forms.
    49 Where simple examples include single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
     64The structure of ready queues can take many different forms, where the basic two are the single-queue multi-server (SQMS) and the multi-queue multi-server (MQMS).
    5065\begin{center}
    5166\begin{tabular}{l|l}
     
    5772Beyond these two schedulers are a host of options, \ie adding an optional global, shared queue to MQMS.
    5873
    59 The three major optimization criteria for a scheduler are:
     74Once there are multiple resources and ready queues, a scheduler is faced with three major optimization criteria:
    6075\begin{enumerate}[leftmargin=*]
    6176\item
     
    7085Essentially, all multi-processor computers have non-uniform memory access (NUMA), with one or more quantized steps to access data at different levels in the memory hierarchy.
    7186When a system has a large number of independently executing threads, affinity becomes difficult because of \newterm{thread churn}.
    72 That is, threads must be scheduled on multiple processors to obtain high processors utilization because the number of threads $\ggg$ processors.
     87That is, threads must be scheduled on different processors to obtain high processors utilization because the number of threads $\ggg$ processors.
    7388
    7489\item
    75 \newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking\footnote{
    76 Lock-free data-structures do not involve locking but incurr similar costs to achieve mutual exclusion.}
     90\newterm{contention}: safe access of shared objects by multiple processors requires mutual exclusion in some form, generally locking.\footnote{
     91Lock-free data-structures do not involve locking but incur similar costs to achieve mutual exclusion.}
    7792
    7893\noindent
    79 Mutual exclusion cost and latency increases significantly with the number of processors accessing a shared object.
     94Mutual exclusion cost and latency increases significantly with the number of processors access\-ing a shared object.
    8095\end{enumerate}
    8196
    82 Nevertheless, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific patterns.
    83 Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time\footnote{Frequency scaling and turbot boost add a degree of complexity that can be ignored in this discussion without loss of generality.}.
    84 SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
    85 MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
     97Scheduling is a zero-sum game as computer processors normally have a fixed, maximum number of cycles per unit time.\footnote{
     98Frequency scaling and turbo-boost add a degree of complexity that can be ignored in this discussion without loss of generality.}
     99Hence, schedulers are a series of compromises, occasionally with some static or dynamic tuning parameters to enhance specific workload patterns.
     100For example, SQMS has perfect load-balancing but poor affinity and high contention by the processors, because of the single queue.
     101While MQMS has poor load-balancing but perfect affinity and no contention, because each processor has its own queue.
    86102
    87 Significant research effort has also looked at load sharing/stealing among queues, when a ready queue is too long or short, respectively.
     103Significant research effort has looked at load balancing by stealing/sharing work units among queues: when a ready queue is too short or long, respectively, load stealing/sharing schedulers attempt to push/pull work units to/from other ready queues.
    88104These approaches attempt to perform better load-balancing at the cost of affinity and contention.
    89 Load sharing/stealing schedulers attempt to push/pull work units to/from other ready queues
     105However, \emph{all} approaches come at a cost (zero-sum game), but not all compromises are necessarily equivalent, especially across workloads.
     106Hence, some schedulers perform very well for specific workloads, while others offer acceptable performance over a wider range of workloads.
    90107
    91 Note however that while any change comes at a cost, hence the zero-sum game, not all compromises are necessarily equivalent.
    92 Some schedulers can perform very well only in very specific workload scenarios, others might offer acceptable performance but be applicable to a wider range of workloads.
    93 Since \CFA attempts to improve the safety and productivity of C, the scheduler presented in this thesis attempts to achieve the same goals.
     108\section{\CFA programming language}
     109
     110The \CFA programming language~\cite{Cforall,Moss18} extends the C programming language by adding modern safety and productivity features, while maintaining backwards compatibility.
     111Among its productivity features, \CFA supports \gls{uthrding}~\cite{Delisle21} as its fundamental threading model allowing programmers to easily write modern concurrent and parallel programs.
     112My previous master's thesis on concurrency in \CFA focused on features and interfaces~\cite{Delisle18}.
     113This Ph.D.\ thesis focuses on performance, introducing \glsxtrshort{api} changes only when required by performance considerations.
     114Specifically, this work concentrates on advanced thread and \glsxtrshort{io} scheduling.
     115Prior to this work, the \CFA runtime used a strict SQMS \gls{rQ} and provided no nonblocking \glsxtrshort{io} capabilities at the user-thread level.\footnote{
     116C/\CC only support \glsxtrshort{io} capabilities at the kernel level, which means many \io operations block \glspl{kthrd} reducing parallelism at the user level.}
     117
     118Since \CFA attempts to improve the safety and productivity of C, the new scheduler presented in this thesis attempts to achieve the same goals.
    94119More specifically, safety and productivity for scheduling means supporting a wide range of workloads so that programmers can rely on progress guarantees (safety) and more easily achieve acceptable performance (productivity).
     120The new scheduler also includes support for implicit nonblocking \io, allowing applications to have more user-threads blocking on \io operations than there are \glspl{kthrd}.
     121To complete the scheduler, an idle sleep mechanism is implemented that significantly reduces wasted CPU cycles, which are then available outside of the application.
    95122
     123As a research project, this work builds exclusively on newer versions of the Linux operating-system and gcc/clang compilers.
     124The new scheduler implementation uses several optimizations to successfully balance the cost of fairness against performance;
     125some of these optimizations rely on interesting hardware optimizations only present on modern CPUs.
     126The \io implementation is based on the @io_uring@ kernel-interface, a recent addition to the Linux kernel, because it purports to handle nonblocking \emph{file} and network \io.
     127This decision allowed an interesting performance and fairness comparison with other threading systems using @select@, @epoll@, \etc.
     128While the current \CFA release supports older versions of Linux ($\ge$~Ubuntu 16.04) and gcc/clang compilers ($\ge$~gcc 6.0), it is not the purpose of this project to find workarounds in these older systems to provide backwards compatibility.
     129The hope is that these new features will soon become mainstream features.
    96130
    97131\section{Contributions}\label{s:Contributions}
    98 This work provides the following contributions in the area of user-level scheduling in an advanced programming-language runtime-system:
     132This work provides the following scheduling contributions for advanced \gls{uthrding} runtime-systems:
    99133\begin{enumerate}[leftmargin=*]
    100134\item
    101135A scalable scheduling algorithm that offers progress guarantees.
    102136\item
     137Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
     138\item
    103139An algorithm for load-balancing and idle sleep of processors, including NUMA awareness.
    104140\item
    105 Support for user-level \glsxtrshort{io} capabilities based on Linux's @io_uring@.
     141A mechanism for adding fairness on top of work-stealing through helping (used both for the I/O and ready-queue).
     142\item
     143An optimization of the work-stealing helping-mechanism for load balancing to reduce scheduling costs.
     144\item
     145An optimization for the alternative relaxed-list for load balancing to reduce scheduling costs in embarrassingly parallel cases.
    106146\end{enumerate}
  • doc/theses/thierry_delisle_PhD/thesis/text/io.tex

    r8040286 r62c5a55  
    1 \chapter{User Level \io}
     1\chapter{User Level \io}\label{s:UserLevelIO}
    22As mentioned in Section~\ref{prev:io}, user-level \io requires multiplexing the \io operations of many \glspl{thrd} onto fewer \glspl{proc} using asynchronous \io operations.
    33Different operating systems offer various forms of asynchronous operations and, as mentioned in Chapter~\ref{intro}, this work is exclusively focused on the Linux operating-system.
  • doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex

    r8040286 r62c5a55  
    1 \chapter{\CFA Runtime}
     1\chapter{\CFA Runtime}\label{s:CFARuntime}
    22This chapter presents an overview of the capabilities of the \CFA runtime prior to this thesis work.
    33
Note: See TracChangeset for help on using the changeset viewer.