Ignore:
Timestamp:
Jun 22, 2018, 1:51:56 PM (4 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
aaron-thesis, arm-eh, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, new-env, no_list, persistent-indexer, with_gc
Children:
3d56d15b
Parents:
999c700
Message:

more updates

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r999c700 r6d43cc57  
    15371537External scheduling allows users to wait for events from other threads without concern of unrelated events occurring.
    15381538The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels.
    1539 Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics.
    1540 Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines.
    1541 The previous example shows a simple use @_Accept@ versus @wait@/@signal@ and its advantages.
    1542 Note that while other languages often use @accept@/@select@ as the core external scheduling keyword, \CFA uses @waitfor@ to prevent name collisions with existing socket \textbf{api}s.
     1539While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics.
     1540Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}).
    15431541
    15441542For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread;
     
    16541652\end{cfa}
    16551653must have acquired monitor locks that are greater than or equal to the number of locks for the waiting thread signalled from the condition queue.
    1656 {\color{red}In general, the signaller does not know the order of waiting threads, so in general, it must acquire the maximum number of mutex locks for the worst-case waiting thread.}
    16571654
    16581655Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@.
     
    18541851The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold.
    18551852Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one.
    1856 
    1857 \subsubsection{Partial Signalling} \label{partial-sig}
    18581853\end{comment}
    18591854
     
    19321927
    19331928\subsection{Loose Object Definitions}
    1934 
    1935 In \uC, a monitor class declaration includes an exhaustive list of monitor operations.
    1936 Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user:
    1937 
    1938 \begin{cfa}
    1939 monitor A {};
    1940 
    1941 void f(A & mutex a);
    1942 void g(A & mutex a) {
    1943         waitfor(f); // Obvious which f() to wait for
    1944 }
    1945 
    1946 void f(A & mutex a, int); // New different F added in scope
    1947 void h(A & mutex a) {
    1948         waitfor(f); // Less obvious which f() to wait for
    1949 }
    1950 \end{cfa}
    1951 
    1952 Furthermore, external scheduling is an example where implementation constraints become visible from the interface.
    1953 Here is the cfa-code for the entering phase of a monitor:
    1954 \begin{center}
    1955 \begin{tabular}{l}
    1956 \begin{cfa}
    1957         if monitor is free
    1958                 enter
    1959         elif already own the monitor
    1960                 continue
    1961         elif monitor accepts me
    1962                 enter
    1963         else
    1964                 block
    1965 \end{cfa}
    1966 \end{tabular}
    1967 \end{center}
     1929\label{s:LooseObjectDefinitions}
     1930
     1931In an object-oriented programming-language, a class includes an exhaustive list of operations.
     1932However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}.
     1933Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement.
     1934\begin{cfa}
     1935monitor M {};
     1936void `f`( M & mutex m );
     1937void g( M & mutex m ) { waitfor( `f` ); }       $\C{// clear which f}$
     1938void `f`( M & mutex m, int );                           $\C{// different f}$
     1939void h( M & mutex m ) { waitfor( `f` ); }       $\C{// unclear which f}$
     1940\end{cfa}
     1941Hence, the cfa-code for the entering a monitor looks like:
     1942\begin{cfa}
     1943if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$
     1944else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$
     1945else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$
     1946else $\LstCommentStyle{// \color{red}block}$
     1947\end{cfa}
    19681948For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions.
    1969 However, a fast check for @monitor accepts me@ is much harder to implement depending on the constraints put on the monitors.
    1970 Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}.
     1949However, a fast check for \emph{monitor accepts me} is much harder to implement depending on the constraints put on the monitors.
     1950Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue.
    19711951
    19721952\begin{figure}
    19731953\centering
    1974 \subfloat[Classical Monitor] {
     1954\subfloat[Classical monitor] {
    19751955\label{fig:ClassicalMonitor}
    1976 {\resizebox{0.45\textwidth}{!}{\input{monitor}}}
     1956{\resizebox{0.45\textwidth}{!}{\input{monitor.pstex_t}}}
    19771957}% subfloat
    1978 \qquad
    1979 \subfloat[bulk acquire Monitor] {
     1958\quad
     1959\subfloat[Bulk acquire monitor] {
    19801960\label{fig:BulkMonitor}
    1981 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}}
     1961{\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}}
    19821962}% subfloat
    1983 \caption{External Scheduling Monitor}
     1963\caption{Monitor Implementation}
     1964\label{f:MonitorImplementation}
    19841965\end{figure}
    19851966
    1986 There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy.
    1987 Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (\eg 128) of mutex members.
    1988 This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units.
    1989 For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance.
    1990 However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit.
    1991 This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects.
    1992 
    1993 The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}.
    1994 Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate.
    1995 Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions.
    1996 Storing an array of accepted routine pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search.
    1997 Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued.
    1998 
     1967For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction.
     1968This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units.
     1969For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines.
     1970
     1971Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation.
     1972The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately.
     1973The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search.
     1974
     1975\begin{comment}
    19991976\begin{figure}
    20001977\begin{cfa}[caption={Example of nested external scheduling},label={f:nest-ext}]
     
    20201997In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be  hard to write.
    20211998This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA.
    2022 
    2023 % ======================================================================
    2024 % ======================================================================
     1999\end{comment}
     2000
     2001
    20252002\subsection{Multi-Monitor Scheduling}
    2026 % ======================================================================
    2027 % ======================================================================
     2003\label{s:Multi-MonitorScheduling}
    20282004
    20292005External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax.
    2030 Even in the simplest possible case, some new semantics needs to be established:
     2006Even in the simplest possible case, new semantics needs to be established:
    20312007\begin{cfa}
    20322008monitor M {};
    2033 
    2034 void f(M & mutex a);
    2035 
    2036 void g(M & mutex b, M & mutex c) {
    2037         waitfor(f); // two monitors M => unknown which to pass to f(M & mutex)
    2038 }
    2039 \end{cfa}
    2040 The obvious solution is to specify the correct monitor as follows:
    2041 
     2009void f( M & mutex m1 );
     2010void g( M & mutex m1, M & mutex m2 ) {
     2011        waitfor( f );                                                   $\C{// pass m1 or m2 to f?}$
     2012}
     2013\end{cfa}
     2014The solution is for the programmer to disambiguate:
     2015\begin{cfa}
     2016        waitfor( f, m2 );                                               $\C{// wait for call to f with argument m2}$
     2017\end{cfa}
     2018Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@ (while @g@ still holds lock @m1@).
     2019This behaviour can be extended to the multi-monitor @waitfor@ statement.
    20422020\begin{cfa}
    20432021monitor M {};
    2044 
    2045 void f(M & mutex a);
    2046 
    2047 void g(M & mutex a, M & mutex b) {
    2048         // wait for call to f with argument b
    2049         waitfor(f, b);
    2050 }
    2051 \end{cfa}
    2052 This syntax is unambiguous.
    2053 Both locks are acquired and kept by @g@.
    2054 When routine @f@ is called, the lock for monitor @b@ is temporarily transferred from @g@ to @f@ (while @g@ still holds lock @a@).
    2055 This behaviour can be extended to the multi-monitor @waitfor@ statement as follows.
    2056 
    2057 \begin{cfa}
    2058 monitor M {};
    2059 
    2060 void f(M & mutex a, M & mutex b);
    2061 
    2062 void g(M & mutex a, M & mutex b) {
    2063         // wait for call to f with arguments a and b
    2064         waitfor(f, a, b);
    2065 }
    2066 \end{cfa}
    2067 
    2068 Note that the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired in the routine. @waitfor@ used in any other context is undefined behaviour.
     2022void f( M & mutex m1, M & mutex m2 );
     2023void g( M & mutex m1, M & mutex m2 ) {
     2024        waitfor( f, m1, m2 );                                   $\C{// wait for call to f with arguments m1 and m2}$
     2025}
     2026\end{cfa}
     2027Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine.
    20692028
    20702029An important behaviour to note is when a set of monitors only match partially:
    2071 
    20722030\begin{cfa}
    20732031mutex struct A {};
    2074 
    20752032mutex struct B {};
    2076 
    2077 void g(A & mutex a, B & mutex b) {
    2078         waitfor(f, a, b);
    2079 }
    2080 
     2033void g( A & mutex m1, B & mutex m2 ) {
     2034        waitfor( f, m1, m2 );
     2035}
    20812036A a1, a2;
    20822037B b;
    2083 
    20842038void foo() {
    2085         g(a1, b); // block on accept
    2086 }
    2087 
     2039        g( a1, b ); // block on accept
     2040}
    20882041void bar() {
    2089         f(a2, b); // fulfill cooperation
     2042        f( a2, b ); // fulfill cooperation
    20902043}
    20912044\end{cfa}
     
    20942047It is also important to note that in the case of external scheduling the order of parameters is irrelevant; @waitfor(f,a,b)@ and @waitfor(f,b,a)@ are indistinguishable waiting condition.
    20952048
    2096 % ======================================================================
    2097 % ======================================================================
     2049
    20982050\subsection{\protect\lstinline|waitfor| Semantics}
    2099 % ======================================================================
    2100 % ======================================================================
    21012051
    21022052Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors.
     
    21972147\end{figure}
    21982148
    2199 % ======================================================================
    2200 % ======================================================================
     2149
    22012150\subsection{Waiting For The Destructor}
    2202 % ======================================================================
    2203 % ======================================================================
     2151
    22042152An interesting use for the @waitfor@ statement is destructor semantics.
    22052153Indeed, the @waitfor@ statement can accept any @mutex@ routine, which includes the destructor (see section \ref{data}).
     
    22282176
    22292177
    2230 % ######     #    ######     #    #       #       ####### #       ###  #####  #     #
    2231 % #     #   # #   #     #   # #   #       #       #       #        #  #     # ##   ##
    2232 % #     #  #   #  #     #  #   #  #       #       #       #        #  #       # # # #
    2233 % ######  #     # ######  #     # #       #       #####   #        #   #####  #  #  #
    2234 % #       ####### #   #   ####### #       #       #       #        #        # #     #
    2235 % #       #     # #    #  #     # #       #       #       #        #  #     # #     #
    2236 % #       #     # #     # #     # ####### ####### ####### ####### ###  #####  #     #
    22372178\section{Parallelism}
     2179
    22382180Historically, computer performance was about processor speeds and instruction counts.
    22392181However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}.
     
    22452187While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads.
    22462188
     2189
    22472190\section{Paradigms}
     2191
     2192
    22482193\subsection{User-Level Threads}
     2194
    22492195A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}.
    22502196These threads offer most of the same features that the operating system already provides but can be used on a much larger scale.
     
    22552201Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
    22562202
     2203
    22572204\subsection{Fibers : User-Level Threads Without Preemption} \label{fibers}
     2205
    22582206A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}.
    22592207However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}.
     
    22642212An example of a language that uses fibers is Go~\cite{Go}
    22652213
     2214
    22662215\subsection{Jobs and Thread Pools}
     2216
    22672217An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}.
    22682218Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface.
     
    22752225The gold standard of this implementation is Intel's TBB library~\cite{TBB}.
    22762226
     2227
    22772228\subsection{Paradigm Performance}
     2229
    22782230While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level.
    22792231Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload.
     
    22832235Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done.
    22842236
     2237
    22852238\section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
     2239
    22862240A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}.
    22872241It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues.
     
    22912245Currently \CFA only supports one \textbf{cfacluster}, the initial one.
    22922246
     2247
    22932248\subsection{Future Work: Machine Setup}\label{machine}
     2249
    22942250While this was not done in the context of this paper, another important aspect of clusters is affinity.
    22952251While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups.
     
    22972253OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
    22982254
     2255
    22992256\subsection{Paradigms}\label{cfaparadigms}
     2257
    23002258Given these building blocks, it is possible to reproduce all three of the popular paradigms.
    23012259Indeed, \textbf{uthread} is the default paradigm in \CFA.
     
    23052263
    23062264
    2307 
    23082265\section{Behind the Scenes}
     2266
    23092267There are several challenges specific to \CFA when implementing concurrency.
    23102268These challenges are a direct result of bulk acquire and loose object definitions.
     
    23232281Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
    23242282
    2325 % ======================================================================
    2326 % ======================================================================
     2283
    23272284\section{Mutex Routines}
    2328 % ======================================================================
    2329 % ======================================================================
    23302285
    23312286The first step towards the monitor implementation is simple @mutex@ routines.
     
    23622317\end{figure}
    23632318
     2319
    23642320\subsection{Details: Interaction with polymorphism}
     2321
    23652322Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support.
    23662323However, it is shown that entry-point locking solves most of the issues.
     
    24422399Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
    24432400
    2444 % ======================================================================
    2445 % ======================================================================
     2401
    24462402\section{Threading} \label{impl:thread}
    2447 % ======================================================================
    2448 % ======================================================================
    24492403
    24502404Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency.
     
    24592413\end{figure}
    24602414
     2415
    24612416\subsection{Processors}
     2417
    24622418Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically @pthread@s in the current implementation of \CFA.
    24632419Indeed, any parallelism must go through operating-system libraries.
     
    24672423Processors internally use coroutines to take advantage of the existing context-switching semantics.
    24682424
     2425
    24692426\subsection{Stack Management}
     2427
    24702428One of the challenges of this system is to reduce the footprint as much as possible.
    24712429Specifically, all @pthread@s created also have a stack created with them, which should be used as much as possible.
     
    24742432In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large.
    24752433
     2434
    24762435\subsection{Context Switching}
     2436
    24772437As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks.
    24782438To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routine call.
     
    24882448This option is not currently present in \CFA, but the changes required to add it are strictly additive.
    24892449
     2450
    24902451\subsection{Preemption} \label{preemption}
     2452
    24912453Finally, an important aspect for any complete threading system is preemption.
    24922454As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution.
     
    25222484Indeed, @sigwait@ can differentiate signals sent from @pthread_sigqueue@ from signals sent from alarms or the kernel.
    25232485
     2486
    25242487\subsection{Scheduler}
    25252488Finally, an aspect that was not mentioned yet is the scheduling algorithm.
     
    25272490Further discussion on scheduling is present in section \ref{futur:sched}.
    25282491
    2529 % ======================================================================
    2530 % ======================================================================
     2492
    25312493\section{Internal Scheduling} \label{impl:intsched}
    2532 % ======================================================================
    2533 % ======================================================================
     2494
    25342495The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience):
    25352496
    25362497\begin{figure}
    25372498\begin{center}
    2538 {\resizebox{0.4\textwidth}{!}{\input{monitor}}}
     2499{\resizebox{0.4\textwidth}{!}{\input{monitor.pstex_t}}}
    25392500\end{center}
    25402501\caption{Traditional illustration of a monitor}
Note: See TracChangeset for help on using the changeset viewer.