source: doc/theses/thierry_delisle_PhD/thesis/text/runtime.tex @ 1d5e5601

Last change on this file since 1d5e5601 was ddcaff6, checked in by Thierry Delisle <tdelisle@…>, 20 months ago

Last corrections to my thesis... hopefully

  • Property mode set to 100644
File size: 8.4 KB
1\chapter{\CFA Runtime}\label{cfaruntime}
2This chapter presents an overview of the capabilities of the \CFA runtime prior to this thesis work.
4\section{C Threading}
6\Celeven introduced threading features, such as the @_Thread_local@ storage class, and libraries @stdatomic.h@ and @threads.h@.
7Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
8While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as \CC does.
9This model uses \glspl{kthrd} to achieve parallelism and concurrency. In this model, every thread of computation maps to an object in the kernel.
10The kernel then has the responsibility of managing these threads, \eg creating, scheduling, blocking.
11A consequence of this approach is that the kernel has a perfect view of every thread executing in the system\footnote{This is not completely true due to primitives like \lstinline|futex|es, which have a significant portion of their logic in user space.}.
13\section{M:N Threading}\label{prev:model}
15Threading in \CFA is based on \Gls{uthrding}, where \ats are the representation of a unit of work. As such, \CFA programmers should expect these units to be fairly inexpensive, \ie programmers should be able to create a large number of \ats and switch among \ats liberally without many performance concerns.
17The \CFA M:N threading model is implemented using many user-level threads mapped onto fewer \glspl{kthrd}.
18The user-level threads have the same semantic meaning as a \glspl{kthrd} in the 1:1 model: they represent an independent thread of execution with its own stack.
19The difference is that user-level threads do not have a corresponding object in the kernel; they are handled by the runtime in user space and scheduled onto \glspl{kthrd}, referred to as \glspl{proc} in this document. \Glspl{proc} run a \at until it context switches out, it then chooses a different \at to run.
22\CFA allows the option to group user-level threading, in the form of clusters.
23Both \ats and \glspl{proc} belong to a specific cluster.
24\Glspl{at} are only scheduled onto \glspl{proc} in the same cluster and scheduling is done independently of other clusters.
25Figure~\ref{fig:system} shows an overview of the \CFA runtime, which allows programmers to tightly control parallelism.
26It also opens the door to handling effects like NUMA, by pinning clusters to a specific NUMA node\footnote{This capability is not currently implemented in \CFA, but the only hurdle left is creating a generic interface for CPU masks.}.
29        \begin{center}
30                \input{system.pstex_t}
31        \end{center}
32        \caption[Overview of the \CFA runtime]{Overview of the \CFA runtime \newline \Glspl{at} are scheduled inside a particular cluster and run on the \glspl{proc} that belong to the cluster. The discrete-event manager, which handles preemption and timeout, is a \gls{proc} that lives outside any cluster and does not run \ats.}
33        \label{fig:system}
37The \CFA runtime previously used a \glsxtrshort{fifo} ready-queue with a single lock. This setup offers perfect fairness in terms of opportunities to run. However, it offers poor scalability, since the performance of the ready queue can never be improved by adding more \glspl{hthrd} and contention can cause significant performance degradation.
40Prior to this work, the \CFA runtime did not have any particular support for \glsxtrshort{io} operations. While all \glsxtrshort{io} operations available in C are available in \CFA, \glsxtrshort{io} operations are designed for the POSIX threading model~\cite{pthreads}. Using these 1:1 threading operations in an M:N threading model means \glsxtrshort{io} operations block \glspl{proc} instead of \ats. While this can work in certain cases, it limits the number of concurrent operations to the number of \glspl{proc} rather than \ats. It also means deadlock can occur because all \glspl{proc} are blocked even if at least one \at is ready to run. A simple example of this type of deadlock would be as follows:
43Given a simple network program with 2 \ats and a single \gls{proc}, one \at sends network requests to a server and the other \at waits for a response from the server.
44If the second \at races ahead, it may wait for responses to requests that have not been sent yet.
45In theory, this should not be a problem, even if the second \at waits, because the first \at is still ready to run and should be able to get CPU time to send the request.
46With M:N threading, while the first \at is ready, the lone \gls{proc} \emph{cannot} run the first \at if it is blocked in the \glsxtrshort{io} operation of the second \at.
47If this happens, the system is in a synchronization deadlock\footnote{In this example, the deadlock could be resolved if the server sends unprompted messages to the client.
48However, this solution is neither general nor appropriate even in this simple case.}.
51Therefore, one of the objectives of this work is to introduce \emph{User-Level \glsxtrshort{io}}, which, like \glslink{uthrding}{User-Level \emph{Threading}}, blocks \ats rather than \glspl{proc} when doing \glsxtrshort{io} operations.
52This feature entails multiplexing the \glsxtrshort{io} operations of many \ats onto fewer \glspl{proc}.
53The multiplexing requires a single \gls{proc} to execute multiple \glsxtrshort{io} operations in parallel.
54This requirement cannot be done with operations that block \glspl{proc}, \ie \glspl{kthrd}, since the first operation would prevent starting new operations for its blocking duration.
55Executing \glsxtrshort{io} operations in parallel requires \emph{asynchronous} \glsxtrshort{io}, sometimes referred to as \emph{non-blocking}, since the \gls{kthrd} does not block.
57\section{Interoperating with C}
58While \glsxtrshort{io} operations are the classical example of operations that block \glspl{kthrd}, the non-blocking challenge extends to all blocking system-calls.
59The POSIX standard states~\cite[\S~2.9.1]{POSIX17}:
61All functions defined by this volume of POSIX.1-2017 shall be thread-safe, except that the following functions need not be thread-safe. ... (list of 70+ excluded functions)
63Only UNIX @man@ pages identify whether a library function is thread-safe, and hence, may block on a pthreads lock or system call; hence, interoperability with UNIX library functions is a challenge for an M:N threading model.
65Languages like Go and Java, which have strict interoperability with C\cite{wiki:jni,go:cgo}, can control operations in C by ``sandboxing'' them, \eg a blocking function may be delegated to a \gls{kthrd}.
66Sandboxing may help towards guaranteeing that the kind of deadlock mentioned above does not occur.
68As mentioned in Section~\ref{intro}, \CFA is binary compatible with C and, as such, must support all C library functions.
69Furthermore, interoperability can happen at the function-call level, inline code, or C and \CFA translation units linked together.
70This fine-grained interoperability between C and \CFA has two consequences:
72        \item Precisely identifying blocking C calls is difficult.
73        \item Introducing safe-point code (see Go~page~\pageref{GoSafePoint}) can have a significant impact on general performance.
75Because of these consequences, this work does not attempt to ``sandbox'' calls to C.
76Therefore, it is possible calls to an unknown library function can block a \gls{kthrd} leading to deadlocks in \CFA's M:N threading model, which would not occur in a traditional 1:1 threading model.
77Since the blocking call is not known to the runtime, it is not necessarily possible to distinguish whether or not a deadlock occurs.
78Currently, all M:N thread systems interacting with UNIX without sandboxing suffer from this problem but manage to work very well in the majority of applications.
79Therefore, a complete solution to this problem is outside the scope of this thesis.
80\footnote{\CFA does provide a pthreads emulation, so any library function using embedded pthreads locks is redirected to \CFA user-level locks.
81This capability further reduces the chances of blocking a \gls{kthrd}.}
82Chapter~\ref{userio} discusses the interoperability with C chosen and used for the evaluation in Chapter~\ref{macrobench}.
Note: See TracBrowser for help on using the repository browser.