Ignore:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r6d43cc57 r251454a0  
    5656\newcommand{\Textbf}[2][red]{{\color{#1}{\textbf{#2}}}}
    5757\newcommand{\Emph}[2][red]{{\color{#1}\textbf{\emph{#2}}}}
     58\newcommand{\R}[1]{\Textbf{#1}}
     59\newcommand{\B}[1]{{\Textbf[blue]{#1}}}
     60\newcommand{\G}[1]{{\Textbf[OliveGreen]{#1}}}
    5861\newcommand{\uC}{$\mu$\CC}
    59 \newcommand{\TODO}[1]{{\Textbf{#1}}}
     62\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
     63\newcommand{\TODO}{{\Textbf{TODO}}}
    6064
    6165%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     
    254258\section{Introduction}
    255259
    256 This paper provides a minimal concurrency \newterm{Application Program Interface} (API) that is simple, efficient and can be used to build other concurrency features.
     260This paper provides a minimal concurrency \newterm{Abstract Program Interface} (API) that is simple, efficient and can be used to build other concurrency features.
    257261While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master.
    258262An easier approach for programmers is to support higher-level constructs as the basis of concurrency.
     
    271275Hence, there are two problems to be solved: concurrency and parallelism.
    272276While these two concepts are often combined, they are distinct, requiring different tools~\cite[\S~2]{Buhr05a}.
    273 Concurrency tools handle mutual exclusion and synchronization, while parallelism tools handle performance, cost, and resource utilization.
     277Concurrency tools handle synchronization and mutual exclusion, while parallelism tools handle performance, cost and resource utilization.
    274278
    275279The proposed concurrency API is implemented in a dialect of C, called \CFA.
     
    282286Extended versions and explanation of the following code examples are available at the \CFA website~\cite{Cforall} or in Moss~\etal~\cite{Moss18}.
    283287
    284 \CFA is a non-object-oriented extension of ISO-C, and hence, supports all C paradigms.
     288\CFA is an extension of ISO-C, and hence, supports all C paradigms.
    285289%It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily.
    286 Like C, the building blocks of \CFA are structures and routines.
     290Like C, the basics of \CFA revolve around structures and functions.
    287291Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions.
    288292While \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}.
     
    296300int x = 1, y = 2, z = 3;
    297301int * p1 = &x, ** p2 = &p1,  *** p3 = &p2,      $\C{// pointers to x}$
    298     `&` r1 = x,   `&&` r2 = r1,   `&&&` r3 = r2;        $\C{// references to x}$
     302        `&` r1 = x,  `&&` r2 = r1,  `&&&` r3 = r2;      $\C{// references to x}$
    299303int * p4 = &z, `&` r4 = z;
    300304
     
    345349        'with' '(' $\emph{expression-list}$ ')' $\emph{compound-statement}$
    346350\end{cfa}
    347 and may appear as the body of a routine or nested within a routine body.
     351and may appear as the body of a function or nested within a function body.
    348352Each expression in the expression-list provides a type and object.
    349353The type must be an aggregate type.
     
    356360
    357361\CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem.
    358 Both variables and routines may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.
     362Both variables and functions may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.
    359363\begin{cquote}
    360364\vspace*{-\baselineskip}%???
     
    411415\end{cquote}
    412416Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects.
    413 Therefore, overloading eliminates long prefixes and other naming conventions to prevent name clashes.
    414 As seen in Section~\ref{basics}, routine @main@ is heavily overloaded.
    415 For example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name:
     417Therefore, overloading is necessary to prevent the need for long prefixes and other naming conventions to prevent name clashes.
     418As seen in Section~\ref{basics}, function @main@ is heavily overloaded.
     419
     420Variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name:
    416421\begin{cfa}
    417422struct S { int `i`; int j; double m; } s;
     
    427432}
    428433\end{cfa}
    429 For parallel semantics, both @s.i@ and @t.i@ are visible with the same type, so only @i@ is ambiguous without qualification.
     434For parallel semantics, both @s.i@ and @t.i@ are visible the same type, so only @i@ is ambiguous without qualification.
    430435
    431436
     
    433438
    434439Overloading also extends to operators.
    435 Operator-overloading syntax creates a routine name with an operator symbol and question marks for the operands:
     440Operator-overloading syntax names a routine with the operator symbol and question marks for the operands:
    436441\begin{cquote}
    437442\lstDeleteShortInline@%
     
    467472\end{cquote}
    468473While concurrency does not use operator overloading directly, it provides an introduction for the syntax of constructors.
     474
     475
     476\subsection{Parametric Polymorphism}
     477\label{s:ParametricPolymorphism}
     478
     479The signature feature of \CFA is parametric-polymorphic functions~\cite{} with functions generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types.
     480For example, the following sum function works for any type that supports construction from 0 and addition:
     481\begin{cfa}
     482forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and +
     483T sum( T a[$\,$], size_t size ) {
     484        `T` total = { `0` };                                    $\C{// initialize by 0 constructor}$
     485        for ( size_t i = 0; i < size; i += 1 )
     486                total = total `+` a[i];                         $\C{// select appropriate +}$
     487        return total;
     488}
     489S sa[5];
     490int i = sum( sa, 5 );                                           $\C{// use S's 0 construction and +}$
     491\end{cfa}
     492
     493\CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each function declaration:
     494\begin{cfa}
     495trait `sumable`( otype T ) {
     496        void `?{}`( T &, zero_t );                              $\C{// 0 literal constructor}$
     497        T `?+?`( T, T );                                                $\C{// assortment of additions}$
     498        T ?+=?( T &, T );
     499        T ++?( T & );
     500        T ?++( T & );
     501};
     502forall( otype T `| sumable( T )` )                      $\C{// use trait}$
     503T sum( T a[$\,$], size_t size );
     504\end{cfa}
     505
     506Assertions can be @otype@ or @dtype@.
     507@otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator.
     508@dtype@ only guarantees an object has a size and alignment.
     509
     510Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@:
     511\begin{cfa}
     512forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); }
     513int * ip = alloc();                                                     $\C{// select type and size from left-hand side}$
     514double * dp = alloc();
     515struct S {...} * sp = alloc();
     516\end{cfa}
     517where the return type supplies the type/size of the allocation, which is impossible in most type systems.
    469518
    470519
     
    495544\CFA also provides @new@ and @delete@, which behave like @malloc@ and @free@, in addition to constructing and destructing objects:
    496545\begin{cfa}
    497 {
    498         ... struct S s = {10}; ...                              $\C{// allocation, call constructor}$
     546{       struct S s = {10};                                              $\C{// allocation, call constructor}$
     547        ...
    499548}                                                                                       $\C{// deallocation, call destructor}$
    500549struct S * s = new();                                           $\C{// allocation, call constructor}$
     
    502551delete( s );                                                            $\C{// deallocation, call destructor}$
    503552\end{cfa}
    504 \CFA concurrency uses object lifetime as a means of mutual exclusion and/or synchronization.
    505 
    506 
    507 \subsection{Parametric Polymorphism}
    508 \label{s:ParametricPolymorphism}
    509 
    510 The signature feature of \CFA is parametric-polymorphic routines~\cite{} with routines generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types.
    511 For example, the following sum routine works for any type that supports construction from 0 and addition:
    512 \begin{cfa}
    513 forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and +
    514 T sum( T a[$\,$], size_t size ) {
    515         `T` total = { `0` };                                    $\C{// initialize by 0 constructor}$
    516         for ( size_t i = 0; i < size; i += 1 )
    517                 total = total `+` a[i];                         $\C{// select appropriate +}$
    518         return total;
    519 }
    520 S sa[5];
    521 int i = sum( sa, 5 );                                           $\C{// use S's 0 construction and +}$
    522 \end{cfa}
    523 The builtin type @zero_t@ (and @one_t@) overload constant 0 (and 1) for a new types, where both 0 and 1 have special meaning in C.
    524 
    525 \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each routine declaration:
    526 \begin{cfa}
    527 trait `sumable`( otype T ) {
    528         void `?{}`( T &, zero_t );                              $\C{// 0 literal constructor}$
    529         T `?+?`( T, T );                                                $\C{// assortment of additions}$
    530         T ?+=?( T &, T );
    531         T ++?( T & );
    532         T ?++( T & );
    533 };
    534 forall( otype T `| sumable( T )` )                      $\C{// use trait}$
    535 T sum( T a[$\,$], size_t size );
    536 \end{cfa}
    537 
    538 Assertions can be @otype@ or @dtype@.
    539 @otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator.
    540 @dtype@ only guarantees an object has a size and alignment.
    541 
    542 Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@:
    543 \begin{cfa}
    544 forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); }
    545 int * ip = alloc();                                                     $\C{// select type and size from left-hand side}$
    546 double * dp = alloc();
    547 struct S {...} * sp = alloc();
    548 \end{cfa}
    549 where the return type supplies the type/size of the allocation, which is impossible in most type systems.
     553\CFA concurrency uses object lifetime as a means of synchronization and/or mutual exclusion.
    550554
    551555
     
    580584\subsection{\protect\CFA's Thread Building Blocks}
    581585
    582 An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional.
     586An important missing feature in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional.
    583587As such, library support for threading is far from widespread.
    584 At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}.
    585 In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
     588At the time of writing the paper, neither \protect\lstinline|gcc| nor \protect\lstinline|clang| support ``threads.h'' in their standard libraries.}.
     589On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
    586590As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages.
    587591Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay.
     
    621625\newbox\myboxA
    622626\begin{lrbox}{\myboxA}
    623 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     627\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    624628`int f1, f2, state = 1;`   // single global variables
    625629int fib() {
     
    638642        }
    639643}
    640 \end{cfa}
     644\end{lstlisting}
    641645\end{lrbox}
    642646
    643647\newbox\myboxB
    644648\begin{lrbox}{\myboxB}
    645 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     649\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    646650#define FIB_INIT `{ 0, 1 }`
    647651typedef struct { int f2, f1; } Fib;
     
    660664        }
    661665}
    662 \end{cfa}
     666\end{lstlisting}
    663667\end{lrbox}
    664668
     
    673677\newbox\myboxA
    674678\begin{lrbox}{\myboxA}
    675 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     679\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    676680`coroutine` Fib { int fn; };
    677681void main( Fib & fib ) with( fib ) {
     
    693697        }
    694698}
    695 \end{cfa}
     699\end{lstlisting}
    696700\end{lrbox}
    697701\newbox\myboxB
    698702\begin{lrbox}{\myboxB}
    699 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     703\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    700704`coroutine` Fib { int ret; };
    701705void main( Fib & f ) with( fib ) {
     
    717721
    718722
    719 \end{cfa}
     723\end{lstlisting}
    720724\end{lrbox}
    721725\subfloat[3 States, internal variables]{\label{f:Coroutine3States}\usebox\myboxA}
     
    727731
    728732Using a coroutine, it is possible to express the Fibonacci formula directly without any of the C problems.
    729 Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @`coroutine` Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@.
     733Figure~\ref{f:Coroutine3States} creates a @coroutine@ type:
     734\begin{cfa}
     735`coroutine` Fib { int fn; };
     736\end{cfa}
     737which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface functions, @next@.
    730738Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main.
    731 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's @resume@.
    732 The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@;
     739The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume.
     740The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@;
    733741on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
    734742The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack;
     
    761769\newbox\myboxA
    762770\begin{lrbox}{\myboxA}
    763 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     771\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    764772`coroutine` Format {
    765773        char ch;   // used for communication
     
    793801        }
    794802}
    795 \end{cfa}
     803\end{lstlisting}
    796804\end{lrbox}
    797805
    798806\newbox\myboxB
    799807\begin{lrbox}{\myboxB}
    800 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     808\begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
    801809struct Format {
    802810        char ch;
     
    830838        format( &fmt );
    831839}
    832 \end{cfa}
     840\end{lstlisting}
    833841\end{lrbox}
    834842\subfloat[\CFA Coroutine]{\label{f:CFAFmt}\usebox\myboxA}
     
    839847\end{figure}
    840848
    841 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines.
    842 However,@resume@/@suspend@ context switch to existing stack-frames rather than create new ones so there is no stack growth.
    843 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming routine for another coroutine, which eventually forms a resuming-call cycle.
     849The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions.
     850However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones.
     851\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle.
    844852(The trivial cycle is a coroutine resuming itself.)
    845853This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch.
     
    923931Figure~\ref{f:ProdCons} shows a producer/consumer symmetric-coroutine performing bi-directional communication.
    924932Since the solution involves a full-coroutining cycle, the program main creates one coroutine in isolation, passes this coroutine to its partner, and closes the cycle at the call to @start@.
    925 The @start@ routine communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
     933The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
    926934Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
    927935@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer.
     
    929937The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
    930938For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine.
    931 The consumer iterates until the @done@ flag is set, prints the values delivered by the producer, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).
     939The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).
    932940The call from the consumer to the @payment@ introduces the cycle between producer and consumer.
    933941When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed.
     
    959967\end{cfa}
    960968and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack.
    961 Furthermore, the execution of constructs/destructors is in the wrong order for certain operations.
    962 For example, for threads if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
     969Furthermore, the execution of constructs/destructors is in the wrong order for certain operations, \eg for threads;
     970\eg, if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
    963971
    964972An alternatively is composition:
     
    972980However, there is nothing preventing wrong placement or multiple declarations.
    973981
    974 For coroutines as for threads, many implementations are based on routine pointers or routine objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
     982For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
    975983For example, Boost implements coroutines in terms of four functor object-types:
    976984\begin{cfa}
     
    980988symmetric_coroutine<>::yield_type
    981989\end{cfa}
    982 Similarly, the canonical threading paradigm is often based on routine pointers, \eg @pthreads@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
     990Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
    983991However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type.
    984992\begin{cfa}
     
    9941002\end{cfa}
    9951003Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little.
    996 
    997 Note, the type @coroutine_t@ must be an abstract handle to the coroutine, because the coroutine descriptor and its stack are non-copyable.
    998 Copying the coroutine descriptor results in copies being out of date with the current state of the stack.
    999 Correspondingly, copying the stack results is copies being out of date with the coroutine descriptor, and pointers in the stack being out of date to data on the stack.
    1000 (There is no mechanism in C to find all stack-specific pointers and update them as part of a copy.)
    10011004
    10021005The selected approach is to use language support by introducing a new kind of aggregate (structure):
     
    10111014Furthermore, implementing coroutines without language supports also displays the power of a programming language.
    10121015While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using the language support.
    1013 The reserved keyword simply eases use for the common cases.
    1014 
    1015 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation routines:
    1016 \begin{cfa}
    1017 trait is_coroutine( `dtype` T ) {
    1018         void main( T & );
    1019         coroutine_desc * get_coroutine( T & );
     1016The reserved keyword eases use for the common cases.
     1017
     1018Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions:
     1019\begin{cfa}
     1020trait is_coroutine( dtype T ) {
     1021      void main( T & this );
     1022      coroutine_desc * get_coroutine( T & this );
    10201023};
    1021 forall( `dtype` T | is_coroutine(T) ) void suspend( T & );
    1022 forall( `dtype` T | is_coroutine(T) ) void resume( T & );
    1023 \end{cfa}
    1024 The @dtype@ property of the trait ensures the coroutine descriptor is non-copyable, so all coroutines must be passed by reference (pointer).
    1025 The routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.
    1026 The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones.
    1027 The generic routines @suspend@ and @resume@ can be redefined, but any object passed to them is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
    1028 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine, and possibly redefining @suspend@ and @resume@.
     1024forall( dtype T | is_coroutine(T) ) void get_coroutine( T & );
     1025forall( dtype T | is_coroutine(T) ) void suspend( T & );
     1026forall( dtype T | is_coroutine(T) ) void resume( T & );
     1027\end{cfa}
     1028This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine.
     1029No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
     1030As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
     1031The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
    10291032The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main:
    10301033\begin{cquote}
     
    10361039};
    10371040\end{cfa}
    1038 &
    1039 {\Large $\Rightarrow$}
    1040 &
     1041& {\Large $\Rightarrow$} &
    10411042\begin{tabular}{@{}ccc@{}}
    10421043\begin{cfa}
     
    10721073Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait:
    10731074\begin{cquote}
    1074 \begin{tabular}{@{}c@{\hspace{3\parindentlnth}}c@{}}
     1075\begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}}
    10751076\begin{cfa}
    10761077thread myThread {
     
    10821083&
    10831084\begin{cfa}
    1084 trait is_thread( `dtype` T ) {
    1085       void main( T & );
    1086       thread_desc * get_thread( T & );
    1087       void ^?{}( T & `mutex` );
     1085trait is_thread( dtype T ) {
     1086      void main( T & this );
     1087      thread_desc * get_thread( T & this );
     1088      void ^?{}( T & `mutex` this );
    10881089};
    10891090\end{cfa}
     
    10911092\end{cquote}
    10921093(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.)
    1093 Like a coroutine, the statically-typed @main@ routine is the starting point (first stack frame) of a user thread.
     1094Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread.
    10941095The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@;
    10951096whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{
    1096 The \lstinline@main@ routine is already a special routine in C, \ie where the program's initial thread begins, so it is a natural extension of this semantics to use overloading to declare \lstinline@main@s for user coroutines and threads.}
    1097 No return value or additional parameters are necessary for this routine because the task type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values.
     1097The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
     1098No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
    10981099
    10991100\begin{comment} % put in appendix with coroutine version ???
     
    11081109
    11091110In this example, threads of type @foo@ start execution in the @void main(foo &)@ routine, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity.
    1110 With the static semantics it is trivial to write a thread type that takes a routine pointer as a parameter and executes it on its stack asynchronously.
    1111 \begin{cfa}
    1112 typedef void (*voidRtn)(int);
    1113 
    1114 thread RtnRunner {
    1115         voidRtn func;
     1111With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.
     1112\begin{cfa}
     1113typedef void (*voidFunc)(int);
     1114
     1115thread FuncRunner {
     1116        voidFunc func;
    11161117        int arg;
    11171118};
    11181119
    1119 void ?{}(RtnRunner & this, voidRtn inRtn, int arg) {
    1120         this.func = inRtn;
     1120void ?{}(FuncRunner & this, voidFunc inFunc, int arg) {
     1121        this.func = inFunc;
    11211122        this.arg  = arg;
    11221123}
    11231124
    1124 void main(RtnRunner & this) {
    1125         // thread starts here and runs the routine
     1125void main(FuncRunner & this) {
     1126        // thread starts here and runs the function
    11261127        this.func( this.arg );
    11271128}
     
    11321133
    11331134int main() {
    1134         RtnRunner f = {hello, 42};
     1135        FuncRunner f = {hello, 42};
    11351136        return 0?
    11361137}
    11371138\end{cfa}
    1138 A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{API}.
     1139A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}.
    11391140\end{comment}
    11401141
     
    11851186void main( Adder & adder ) with( adder ) {
    11861187    subtotal = 0;
    1187     for ( int c = 0; c < cols; c += 1 ) { subtotal += row[c]; }
     1188    for ( int c = 0; c < cols; c += 1 ) {
     1189                subtotal += row[c];
     1190    }
    11881191}
    11891192int main() {
     
    12071210
    12081211
    1209 \section{Mutual Exclusion / Synchronization}
     1212\section{Synchronization / Mutual Exclusion}
    12101213
    12111214Uncontrolled non-deterministic execution is meaningless.
    1212 To reestablish meaningful execution requires mechanisms to reintroduce determinism, \ie restrict non-determinism, called mutual exclusion and synchronization, where mutual exclusion is an access-control mechanism on data shared by threads, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}.
    1213 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it, \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala).
    1214 In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts, \eg channels~\cite{CSP,Go}.
    1215 However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm, between regular and concurrent computation, \eg routine call versus message passing.
    1216 Hence, a programmer must learn and manipulate two sets of design patterns.
     1215To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where synchronization is a timing relationship among threads and mutual exclusion is an access-control mechanism on data shared by threads.
     1216Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)).
     1217In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}).
     1218However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing).
     1219This distinction means a programmers needs to learn two sets of design patterns.
    12171220While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
    12181221In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
    12191222
    1220 At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b}, barriers, and path expressions~\cite{Campbell74}.
     1223At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
    12211224However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
    1222 A newer approach for restricting non-determinism is transactional memory~\cite{Herlihy93}.
     1225A newer approach is transactional memory~\cite{Herlihy93}.
    12231226While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
    12241227
    1225 One of the most natural, elegant, and efficient mechanisms for mutual exclusion and synchronization for shared-memory systems is the \emph{monitor}.
    1226 First proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}, many concurrent programming-languages provide monitors as an explicit language construct: \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}.
    1227 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as mutex locks or semaphores to simulate monitors.
    1228 For these reasons, \CFA selected monitors as the core high-level concurrency-construct, upon which higher-level approaches can be easily constructed.
     1228One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
     1229Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
     1230Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs.
     1231In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
     1232For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..
    12291233
    12301234
     
    12321236
    12331237A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}.
    1234 The generalization is called a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
     1238A generalization is a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
    12351239The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session.
    1236 \newterm{Mutual exclusion} enforces that the correct kind and number of threads are using a critical section.
     1240\newterm{Mutual exclusion} enforces the correction number of threads are using a critical section at the same time.
    12371241
    12381242However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use.
    12391243Methods range from low-level locks, which are fast and flexible but require significant attention for correctness, to higher-level concurrency techniques, which sacrifice some performance to improve ease of use.
    1240 Ease of use comes by either guaranteeing some problems cannot occur, \eg deadlock free, or by offering a more explicit coupling between shared data and critical section.
    1241 For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations, \eg reading/writing, for numerical types.
    1242 However, a significant challenge with locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
    1243 Easing composability is another feature higher-level mutual-exclusion mechanisms can offer.
     1244Ease of use comes by either guaranteeing some problems cannot occur (\eg deadlock free), or by offering a more explicit coupling between shared data and critical section.
     1245For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing large types atomically).
     1246However, a significant challenge with (low-level) locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
     1247Easing composability is another feature higher-level mutual-exclusion mechanisms offer.
    12441248
    12451249
     
    12471251
    12481252Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships.
    1249 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use;
    1250 higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg message passing, or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
    1251 Often synchronization is used to order access to a critical section, \eg ensuring a reader thread is the next kind of thread to enter a critical section.
    1252 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader has \newterm{barged}.
    1253 Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation).
     1253Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use.
     1254Higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
     1255As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}.
     1256Often synchronization is used to order access to a critical section, \eg ensuring the next kind of thread to enter a critical section is a reader thread
     1257If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, the reader has \newterm{barged}.
     1258Barging can result in staleness/freshness problems, where a reader barges ahead of a write and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from having an opportunity to be read.
    12541259Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs.
    1255 This challenge is often split into two different approaches: barging avoidance and barging prevention.
    1256 Algorithms that allow a barger, but divert it until later using current synchronization state (flags), are avoiding the barger;
    1257 algorithms that preclude a barger from entering during synchronization in the critical section prevent barging completely.
    1258 Techniques like baton-pass locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention.
     1260This challenge is often split into two different approaches, barging avoidance and barging prevention.
     1261Algorithms that allow a barger but divert it until later are avoiding the barger, while algorithms that preclude a barger from entering during synchronization in the critical section prevent the barger completely.
     1262baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
    12591263
    12601264
     
    12621266\label{s:Monitors}
    12631267
    1264 A \textbf{monitor} is a set of routines that ensure mutual exclusion when accessing shared state.
    1265 More precisely, a monitor is a programming technique that binds mutual exclusion to routine scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation).
    1266 The strong association with the call/return paradigm eases programmability, readability and maintainability, at a slight cost in flexibility and efficiency.
    1267 
    1268 Note, like coroutines/threads, both locks and monitors require an abstract handle to reference them, because at their core, both mechanisms are manipulating non-copyable shared-state.
    1269 Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data.
    1270 Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity.
    1271 As for coroutines/tasks, a non-copyable (@dtype@) trait is used to capture this requirement, so all locks/monitors must be passed by reference (pointer).
    1272 \begin{cfa}
    1273 trait is_monitor( `dtype` T ) {
     1268A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state.
     1269More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine.
     1270This strong association eases readability and maintainability, at the cost of flexibility.
     1271Note that both monitors and mutex locks, require an abstract handle to identify them.
     1272This concept is generally associated with object-oriented languages like Java~\cite{Java} or \uC~\cite{uC++book} but does not strictly require OO semantics.
     1273The only requirement is the ability to declare a handle to a shared object and a set of routines that act on it:
     1274\begin{cfa}
     1275typedef /*some monitor type*/ monitor;
     1276int f(monitor & m);
     1277
     1278int main() {
     1279        monitor m;  // Handle m
     1280        f(m);       // Routine using handle
     1281}
     1282\end{cfa}
     1283
     1284% ======================================================================
     1285% ======================================================================
     1286\subsection{Call Semantics} \label{call}
     1287% ======================================================================
     1288% ======================================================================
     1289The above monitor example displays some of the intrinsic characteristics.
     1290First, it is necessary to use pass-by-reference over pass-by-value for monitor routines.
     1291This semantics is important, because at their core, monitors are implicit mutual-exclusion objects (locks), and these objects cannot be copied.
     1292Therefore, monitors are non-copy-able objects (@dtype@).
     1293
     1294Another aspect to consider is when a monitor acquires its mutual exclusion.
     1295For example, a monitor may need to be passed through multiple helper routines that do not acquire the monitor mutual-exclusion on entry.
     1296Passthrough can occur for generic helper routines (@swap@, @sort@, \etc) or specific helper routines like the following to implement an atomic counter:
     1297
     1298\begin{cfa}
     1299monitor counter_t { /*...see section $\ref{data}$...*/ };
     1300
     1301void ?{}(counter_t & nomutex this); // constructor
     1302size_t ++?(counter_t & mutex this); // increment
     1303
     1304// need for mutex is platform dependent
     1305void ?{}(size_t * this, counter_t & mutex cnt); // conversion
     1306\end{cfa}
     1307This counter is used as follows:
     1308\begin{center}
     1309\begin{tabular}{c @{\hskip 0.35in} c @{\hskip 0.35in} c}
     1310\begin{cfa}
     1311// shared counter
     1312counter_t cnt1, cnt2;
     1313
     1314// multiple threads access counter
     1315thread 1 : cnt1++; cnt2++;
     1316thread 2 : cnt1++; cnt2++;
     1317thread 3 : cnt1++; cnt2++;
     1318        ...
     1319thread N : cnt1++; cnt2++;
     1320\end{cfa}
     1321\end{tabular}
     1322\end{center}
     1323Notice how the counter is used without any explicit synchronization and yet supports thread-safe semantics for both reading and writing, which is similar in usage to the \CC template @std::atomic@.
     1324
     1325Here, the constructor (@?{}@) uses the @nomutex@ keyword to signify that it does not acquire the monitor mutual-exclusion when constructing.
     1326This semantics is because an object not yet constructed should never be shared and therefore does not require mutual exclusion.
     1327Furthermore, it allows the implementation greater freedom when it initializes the monitor locking.
     1328The prefix increment operator uses @mutex@ to protect the incrementing process from race conditions.
     1329Finally, there is a conversion operator from @counter_t@ to @size_t@.
     1330This conversion may or may not require the @mutex@ keyword depending on whether or not reading a @size_t@ is an atomic operation.
     1331
     1332For maximum usability, monitors use \textbf{multi-acq} semantics, which means a single thread can acquire the same monitor multiple times without deadlock.
     1333For example, listing \ref{fig:search} uses recursion and \textbf{multi-acq} to print values inside a binary tree.
     1334\begin{figure}
     1335\begin{cfa}[caption={Recursive printing algorithm using \textbf{multi-acq}.},label={fig:search}]
     1336monitor printer { ... };
     1337struct tree {
     1338        tree * left, right;
     1339        char * value;
     1340};
     1341void print(printer & mutex p, char * v);
     1342
     1343void print(printer & mutex p, tree * t) {
     1344        print(p, t->value);
     1345        print(p, t->left );
     1346        print(p, t->right);
     1347}
     1348\end{cfa}
     1349\end{figure}
     1350
     1351Having both @mutex@ and @nomutex@ keywords can be redundant, depending on the meaning of a routine having neither of these keywords.
     1352For example, it is reasonable that it should default to the safest option (@mutex@) when given a routine without qualifiers @void foo(counter_t & this)@, whereas assuming @nomutex@ is unsafe and may cause subtle errors.
     1353On the other hand, @nomutex@ is the ``normal'' parameter behaviour, it effectively states explicitly that ``this routine is not special''.
     1354Another alternative is making exactly one of these keywords mandatory, which provides the same semantics but without the ambiguity of supporting routines with neither keyword.
     1355Mandatory keywords would also have the added benefit of being self-documented but at the cost of extra typing.
     1356While there are several benefits to mandatory keywords, they do bring a few challenges.
     1357Mandatory keywords in \CFA would imply that the compiler must know without doubt whether or not a parameter is a monitor or not.
     1358Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred.
     1359For this reason, \CFA only has the @mutex@ keyword and uses no keyword to mean @nomutex@.
     1360
     1361The next semantic decision is to establish when @mutex@ may be used as a type qualifier.
     1362Consider the following declarations:
     1363\begin{cfa}
     1364int f1(monitor & mutex m);
     1365int f2(const monitor & mutex m);
     1366int f3(monitor ** mutex m);
     1367int f4(monitor * mutex m []);
     1368int f5(graph(monitor *) & mutex m);
     1369\end{cfa}
     1370The problem is to identify which object(s) should be acquired.
     1371Furthermore, each object needs to be acquired only once.
     1372In the case of simple routines like @f1@ and @f2@ it is easy to identify an exhaustive list of objects to acquire on entry.
     1373Adding indirections (@f3@) still allows the compiler and programmer to identify which object is acquired.
     1374However, adding in arrays (@f4@) makes it much harder.
     1375Array lengths are not necessarily known in C, and even then, making sure objects are only acquired once becomes none-trivial.
     1376This problem can be extended to absurd limits like @f5@, which uses a graph of monitors.
     1377To make the issue tractable, this project imposes the requirement that a routine may only acquire one monitor per parameter and it must be the type of the parameter with at most one level of indirection (ignoring potential qualifiers).
     1378Also note that while routine @f3@ can be supported, meaning that monitor @**m@ is acquired, passing an array to this routine would be type-safe and yet result in undefined behaviour because only the first element of the array is acquired.
     1379However, this ambiguity is part of the C type-system with respects to arrays.
     1380For this reason, @mutex@ is disallowed in the context where arrays may be passed:
     1381\begin{cfa}
     1382int f1(monitor & mutex m);    // Okay : recommended case
     1383int f2(monitor * mutex m);    // Not Okay : Could be an array
     1384int f3(monitor mutex m []);  // Not Okay : Array of unknown length
     1385int f4(monitor ** mutex m);   // Not Okay : Could be an array
     1386int f5(monitor * mutex m []); // Not Okay : Array of unknown length
     1387\end{cfa}
     1388Note that not all array functions are actually distinct in the type system.
     1389However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic.
     1390
     1391Unlike object-oriented monitors, where calling a mutex member \emph{implicitly} acquires mutual-exclusion of the receiver object, \CFA uses an explicit mechanism to specify the object that acquires mutual-exclusion.
     1392A consequence of this approach is that it extends naturally to multi-monitor calls.
     1393\begin{cfa}
     1394int f(MonitorA & mutex a, MonitorB & mutex b);
     1395
     1396MonitorA a;
     1397MonitorB b;
     1398f(a,b);
     1399\end{cfa}
     1400While OO monitors could be extended with a mutex qualifier for multiple-monitor calls, no example of this feature could be found.
     1401The capability to acquire multiple locks before entering a critical section is called \emph{\textbf{bulk-acq}}.
     1402In practice, writing multi-locking routines that do not lead to deadlocks is tricky.
     1403Having language support for such a feature is therefore a significant asset for \CFA.
     1404In the case presented above, \CFA guarantees that the order of acquisition is consistent across calls to different routines using the same monitors as arguments.
     1405This consistent ordering means acquiring multiple monitors is safe from deadlock when using \textbf{bulk-acq}.
     1406However, users can still force the acquiring order.
     1407For example, notice which routines use @mutex@/@nomutex@ and how this affects acquiring order:
     1408\begin{cfa}
     1409void foo(A& mutex a, B& mutex b) { // acquire a & b
     1410        ...
     1411}
     1412
     1413void bar(A& mutex a, B& /*nomutex*/ b) { // acquire a
     1414        ... foo(a, b); ... // acquire b
     1415}
     1416
     1417void baz(A& /*nomutex*/ a, B& mutex b) { // acquire b
     1418        ... foo(a, b); ... // acquire a
     1419}
     1420\end{cfa}
     1421The \textbf{multi-acq} monitor lock allows a monitor lock to be acquired by both @bar@ or @baz@ and acquired again in @foo@.
     1422In the calls to @bar@ and @baz@ the monitors are acquired in opposite order.
     1423
     1424However, such use leads to lock acquiring order problems.
     1425In the example above, the user uses implicit ordering in the case of function @foo@ but explicit ordering in the case of @bar@ and @baz@.
     1426This subtle difference means that calling these routines concurrently may lead to deadlock and is therefore undefined behaviour.
     1427As shown~\cite{Lister77}, solving this problem requires:
     1428\begin{enumerate}
     1429        \item Dynamically tracking the monitor-call order.
     1430        \item Implement rollback semantics.
     1431\end{enumerate}
     1432While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is still prohibitively complex~\cite{Dice10}.
     1433In \CFA, users simply need to be careful when acquiring multiple monitors at the same time or only use \textbf{bulk-acq} of all the monitors.
     1434While \CFA provides only a partial solution, most systems provide no solution and the \CFA partial solution handles many useful cases.
     1435
     1436For example, \textbf{multi-acq} and \textbf{bulk-acq} can be used together in interesting ways:
     1437\begin{cfa}
     1438monitor bank { ... };
     1439
     1440void deposit( bank & mutex b, int deposit );
     1441
     1442void transfer( bank & mutex mybank, bank & mutex yourbank, int me2you) {
     1443        deposit( mybank, -me2you );
     1444        deposit( yourbank, me2you );
     1445}
     1446\end{cfa}
     1447This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}.
     1448Without \textbf{multi-acq} and \textbf{bulk-acq}, the solution to this problem is much more involved and requires careful engineering.
     1449
     1450
     1451\subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt}
     1452
     1453The call semantics discussed above have one software engineering issue: only a routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the @mutex@ statement to work around the need for unnecessary names, avoiding a major software engineering problem~\cite{2FTwoHardThings}.
     1454Table \ref{f:mutex-stmt} shows an example of the @mutex@ statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired.
     1455Beyond naming, the @mutex@ statement has no semantic difference from a routine call with @mutex@ parameters.
     1456
     1457\begin{table}
     1458\begin{center}
     1459\begin{tabular}{|c|c|}
     1460function call & @mutex@ statement \\
     1461\hline
     1462\begin{cfa}[tabsize=3]
     1463monitor M {};
     1464void foo( M & mutex m1, M & mutex m2 ) {
     1465        // critical section
     1466}
     1467
     1468void bar( M & m1, M & m2 ) {
     1469        foo( m1, m2 );
     1470}
     1471\end{cfa}&\begin{cfa}[tabsize=3]
     1472monitor M {};
     1473void bar( M & m1, M & m2 ) {
     1474        mutex(m1, m2) {
     1475                // critical section
     1476        }
     1477}
     1478
     1479
     1480\end{cfa}
     1481\end{tabular}
     1482\end{center}
     1483\caption{Regular call semantics vs. \protect\lstinline|mutex| statement}
     1484\label{f:mutex-stmt}
     1485\end{table}
     1486
     1487% ======================================================================
     1488% ======================================================================
     1489\subsection{Data semantics} \label{data}
     1490% ======================================================================
     1491% ======================================================================
     1492Once the call semantics are established, the next step is to establish data semantics.
     1493Indeed, until now a monitor is used simply as a generic handle but in most cases monitors contain shared data.
     1494This data should be intrinsic to the monitor declaration to prevent any accidental use of data without its appropriate protection.
     1495For example, here is a complete version of the counter shown in section \ref{call}:
     1496\begin{cfa}
     1497monitor counter_t {
     1498        int value;
     1499};
     1500
     1501void ?{}(counter_t & this) {
     1502        this.cnt = 0;
     1503}
     1504
     1505int ?++(counter_t & mutex this) {
     1506        return ++this.value;
     1507}
     1508
     1509// need for mutex is platform dependent here
     1510void ?{}(int * this, counter_t & mutex cnt) {
     1511        *this = (int)cnt;
     1512}
     1513\end{cfa}
     1514
     1515Like threads and coroutines, monitors are defined in terms of traits with some additional language support in the form of the @monitor@ keyword.
     1516The monitor trait is:
     1517\begin{cfa}
     1518trait is_monitor(dtype T) {
    12741519        monitor_desc * get_monitor( T & );
    12751520        void ^?{}( T & mutex );
    12761521};
    12771522\end{cfa}
    1278 
    1279 
    1280 \subsection{Mutex Acquisition}
    1281 \label{s:MutexAcquisition}
    1282 
    1283 While correctness implicitly implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs.
    1284 (Much of this discussion also applies to basic locks.)
    1285 For example, a monitor may need to be passed through multiple helper routines before it becomes necessary to acquire the monitor mutual-exclusion.
    1286 \begin{cfa}[morekeywords=nomutex]
    1287 monitor Aint { int cnt; };                                      $\C{// atomic integer counter}$
    1288 void ?{}( Aint & `nomutex` this ) with( this ) { cnt = 0; } $\C{// constructor}$
    1289 int ?=?( Aint & `mutex`$\(_{opt}\)$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}$
    1290 void ?{}( int & this, Aint & `mutex`$\(_{opt}\)$ v ) { this = v.cnt; }
    1291 int ?=?( int & lhs, Aint & `mutex`$\(_{opt}\)$ rhs ) with( rhs ) { lhs = cnt; }
    1292 int ++?( Aint & `mutex`$\(_{opt}\)$ this ) with( this ) { return ++cnt; } $\C{// increment}$
    1293 \end{cfa}
    1294 The @Aint@ constructor, @?{}@, uses the \lstinline[morekeywords=nomutex]@nomutex@ qualifier indicating mutual exclusion is unnecessary during construction because an object is inaccessible (private) until after it is initialized.
    1295 (While a constructor may publish its address into a global variable, doing so generates a race-condition.)
    1296 The conversion operators for initializing and assigning with a normal integer only need @mutex@, if reading/writing the implementation type is not atomic.
    1297 Finally, the prefix increment operato, @++?@, is normally @mutex@ to protect the incrementing from race conditions, unless there is an atomic increment instruction for the implementation type.
    1298 
    1299 The atomic counter is used without any explicit mutual-exclusion and provides thread-safe semantics, which is similar to the \CC template @std::atomic@.
    1300 \begin{cfa}
    1301 Aint x, y, z;
    1302 ++x; ++y; ++z;                                                          $\C{// safe increment by multiple threads}$
    1303 x = 2; y = 2; z = 2;                                            $\C{// conversions}$
    1304 int i = x, j = y, k = z;
    1305 i = x; j = y; k = z;
    1306 \end{cfa}
    1307 
    1308 For maximum usability, monitors have \newterm{multi-acquire} semantics allowing a thread to acquire it multiple times without deadlock.
    1309 For example, atomically printing the contents of a binary tree:
    1310 \begin{cfa}
    1311 monitor Tree {
    1312         Tree * left, right;
    1313         // value
    1314 };
    1315 void print( Tree & mutex tree ) {                       $\C{// prefix traversal}$
    1316         // write value
    1317         print( tree->left );                                    $\C{// multiply acquire monitor lock on each recursion}$
    1318         print( tree->right );
    1319 }
    1320 \end{cfa}
    1321 
    1322 Mandatory monitor qualifiers have the benefit of being self-documented, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant.
    1323 Instead, one of qualifier semantics can be the default, and the other required.
    1324 For example, assume the safe @mutex@ option for a monitor parameter because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors.
    1325 On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly ``this parameter is not special''.
    1326 Providing a default qualifier implies knowing whether a parameter is a monitor.
    1327 Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred.
    1328 For this reason, \CFA requires programmers to identify the kind of parameter with the @mutex@ keyword and uses no keyword to mean \lstinline[morekeywords=nomutex]@nomutex@.
    1329 
    1330 The next semantic decision is establishing which parameter \emph{types} may be qualified with @mutex@.
    1331 Given:
    1332 \begin{cfa}
    1333 monitor M { ... }
    1334 int f1( M & mutex m );
    1335 int f2( M * mutex m );
    1336 int f3( M * mutex m[] );
    1337 int f4( stack( M * ) & mutex m );
    1338 \end{cfa}
    1339 the issue is that some of these parameter types are composed of multiple objects.
    1340 For @f1@, there is only a single parameter object.
    1341 Adding indirection in @f2@ still identifies a single object.
    1342 However, the matrix in @f3@ introduces multiple objects.
    1343 While shown shortly, multiple acquisition is possible;
    1344 however array lengths are often unknown in C.
    1345 This issue is exacerbated in @f4@, where the data structure must be safely traversed to acquire all of its elements.
    1346 
    1347 To make the issue tractable, \CFA only acquires one monitor per parameter with at most one level of indirection.
    1348 However, the C type-system has an ambiguity with respects to arrays.
    1349 Is the argument for @f2@ a single object or an array of objects?
    1350 If it is an array, only the first element of the array is acquired, which seems unsafe;
    1351 hence, @mutex@ is disallowed for array parameters.
    1352 \begin{cfa}
    1353 int f1( M & mutex m );                                          $\C{// allowed: recommended case}$
    1354 int f2( M * mutex m );                                          $\C{// disallowed: could be an array}$
    1355 int f3( M mutex m[$\,$] );                                      $\C{// disallowed: array length unknown}$
    1356 int f4( M ** mutex m );                                         $\C{// disallowed: could be an array}$
    1357 int f5( M * mutex m[$\,$] );                            $\C{// disallowed: array length unknown}$
    1358 \end{cfa}
    1359 % Note, not all array routines have distinct types: @f2@ and @f3@ have the same type, as do @f4@ and @f5@.
    1360 % However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic.
    1361 
    1362 For object-oriented monitors, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @`rec`.foo(...)@.
    1363 \CFA has no receiver, and hence, must use an explicit mechanism to specify which object has mutual exclusion acquired.
    1364 A positive consequence of this design decision is the ability to support multi-monitor routines.
    1365 \begin{cfa}
    1366 int f( M & mutex x, M & mutex y );              $\C{// multiple monitor parameter of any type}$
    1367 M m1, m2;
    1368 f( m1, m2 );
    1369 \end{cfa}
    1370 (While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.)
    1371 In practice, writing multi-locking routines that do not deadlock is tricky.
    1372 Having language support for such a feature is therefore a significant asset for \CFA.
    1373 
    1374 The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire}.
    1375 In the previous example, \CFA guarantees the order of acquisition is consistent across calls to different routines using the same monitors as arguments.
    1376 This consistent ordering means acquiring multiple monitors is safe from deadlock.
    1377 However, users can force the acquiring order.
    1378 For example, notice the use of @mutex@/\lstinline[morekeywords=nomutex]@nomutex@ and how this affects the acquiring order:
    1379 \begin{cfa}
    1380 void foo( M & mutex m1, M & mutex m2 );         $\C{// acquire m1 and m2}$
    1381 void bar( M & mutex m1, M & /* nomutex */ m2 ) { $\C{// acquire m1}$
    1382         ... foo( m1, m2 ); ...                                  $\C{// acquire m2}$
    1383 }
    1384 void baz( M & /* nomutex */ m1, M & mutex m2 ) { $\C{// acquire m2}$
    1385         ... foo( m1, m2 ); ...                                  $\C{// acquire m1}$
    1386 }
    1387 \end{cfa}
    1388 The multi-acquire semantics allows @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@.
    1389 In the calls to @bar@ and @baz@, the monitors are acquired in opposite order.
    1390 
    1391 However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamically tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}.
    1392 In \CFA, safety is guaranteed by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid.
    1393 While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases.
    1394 \begin{cfa}
    1395 monitor Bank { ... };
    1396 void deposit( Bank & `mutex` b, int deposit );
    1397 void transfer( Bank & `mutex` mybank, Bank & `mutex` yourbank, int me2you) {
    1398         deposit( mybank, `-`me2you );                   $\C{// debit}$
    1399         deposit( yourbank, me2you );                    $\C{// credit}$
    1400 }
    1401 \end{cfa}
    1402 This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}.
    1403 Without multi- and bulk acquire, the solution to this problem requires careful engineering.
    1404 
    1405 
    1406 \subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt}
    1407 
    1408 The monitor call-semantics associate all locking semantics to routines.
    1409 Like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming.
    1410 \begin{cquote}
    1411 \begin{tabular}{@{}c|@{\hspace{\parindentlnth}}c@{}}
    1412 routine call & @mutex@ statement \\
    1413 \begin{cfa}
    1414 monitor M {};
    1415 void foo( M & mutex m1, M & mutex m2 ) {
    1416         // critical section
    1417 }
    1418 void bar( M & m1, M & m2 ) {
    1419         foo( m1, m2 );
    1420 }
    1421 \end{cfa}
    1422 &
    1423 \begin{cfa}
    1424 
    1425 void bar( M & m1, M & m2 ) {
    1426         mutex( m1, m2 ) {       // remove refactoring and naming
    1427                 // critical section
     1523Note that the destructor of a monitor must be a @mutex@ routine to prevent deallocation while a thread is accessing the monitor.
     1524As with any object, calls to a monitor, using @mutex@ or otherwise, is undefined behaviour after the destructor has run.
     1525
     1526% ======================================================================
     1527% ======================================================================
     1528\section{Internal Scheduling} \label{intsched}
     1529% ======================================================================
     1530% ======================================================================
     1531In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization.
     1532With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}.
     1533With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (\ie with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (\ie without access to the shared state).
     1534Since internal scheduling within a single monitor is mostly a solved problem, this paper concentrates on extending internal scheduling to multiple monitors.
     1535Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
     1536
     1537First, here is a simple example of internal scheduling:
     1538
     1539\begin{cfa}
     1540monitor A {
     1541        condition e;
     1542}
     1543
     1544void foo(A& mutex a1, A& mutex a2) {
     1545        ...
     1546        // Wait for cooperation from bar()
     1547        wait(a1.e);
     1548        ...
     1549}
     1550
     1551void bar(A& mutex a1, A& mutex a2) {
     1552        // Provide cooperation for foo()
     1553        ...
     1554        // Unblock foo
     1555        signal(a1.e);
     1556}
     1557\end{cfa}
     1558There are two details to note here.
     1559First, @signal@ is a delayed operation; it only unblocks the waiting thread when it reaches the end of the critical section.
     1560This semantics is needed to respect mutual-exclusion, \ie the signaller and signalled thread cannot be in the monitor simultaneously.
     1561The alternative is to return immediately after the call to @signal@, which is significantly more restrictive.
     1562Second, in \CFA, while it is common to store a @condition@ as a field of the monitor, a @condition@ variable can be stored/created independently of a monitor.
     1563Here routine @foo@ waits for the @signal@ from @bar@ before making further progress, ensuring a basic ordering.
     1564
     1565An important aspect of the implementation is that \CFA does not allow barging, which means that once function @bar@ releases the monitor, @foo@ is guaranteed to be the next thread to acquire the monitor (unless some other thread waited on the same condition).
     1566This guarantee offers the benefit of not having to loop around waits to recheck that a condition is met.
     1567The main reason \CFA offers this guarantee is that users can easily introduce barging if it becomes a necessity but adding barging prevention or barging avoidance is more involved without language support.
     1568Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency.
     1569
     1570% ======================================================================
     1571% ======================================================================
     1572\subsection{Internal Scheduling - Multi-Monitor}
     1573% ======================================================================
     1574% ======================================================================
     1575It is easy to understand the problem of multi-monitor scheduling using a series of pseudo-code examples.
     1576Note that for simplicity in the following snippets of pseudo-code, waiting and signalling is done using an implicit condition variable, like Java built-in monitors.
     1577Indeed, @wait@ statements always use the implicit condition variable as parameters and explicitly name the monitors (A and B) associated with the condition.
     1578Note that in \CFA, condition variables are tied to a \emph{group} of monitors on first use (called branding), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors.
     1579The example below shows the simple case of having two threads (one for each column) and a single monitor A.
     1580
     1581\begin{multicols}{2}
     1582thread 1
     1583\begin{cfa}
     1584acquire A
     1585        wait A
     1586release A
     1587\end{cfa}
     1588
     1589\columnbreak
     1590
     1591thread 2
     1592\begin{cfa}
     1593acquire A
     1594        signal A
     1595release A
     1596\end{cfa}
     1597\end{multicols}
     1598One thread acquires before waiting (atomically blocking and releasing A) and the other acquires before signalling.
     1599It is important to note here that both @wait@ and @signal@ must be called with the proper monitor(s) already acquired.
     1600This semantic is a logical requirement for barging prevention.
     1601
     1602A direct extension of the previous example is a \textbf{bulk-acq} version:
     1603\begin{multicols}{2}
     1604\begin{cfa}
     1605acquire A & B
     1606        wait A & B
     1607release A & B
     1608\end{cfa}
     1609\columnbreak
     1610\begin{cfa}
     1611acquire A & B
     1612        signal A & B
     1613release A & B
     1614\end{cfa}
     1615\end{multicols}
     1616\noindent This version uses \textbf{bulk-acq} (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning.
     1617Synchronization happens between the two threads in exactly the same way and order.
     1618The only difference is that mutual exclusion covers a group of monitors.
     1619On the implementation side, handling multiple monitors does add a degree of complexity as the next few examples demonstrate.
     1620
     1621While deadlock issues can occur when nesting monitors, these issues are only a symptom of the fact that locks, and by extension monitors, are not perfectly composable.
     1622For monitors, a well-known deadlock problem is the Nested Monitor Problem~\cite{Lister77}, which occurs when a @wait@ is made by a thread that holds more than one monitor.
     1623For example, the following cfa-code runs into the nested-monitor problem:
     1624\begin{multicols}{2}
     1625\begin{cfa}
     1626acquire A
     1627        acquire B
     1628                wait B
     1629        release B
     1630release A
     1631\end{cfa}
     1632
     1633\columnbreak
     1634
     1635\begin{cfa}
     1636acquire A
     1637        acquire B
     1638                signal B
     1639        release B
     1640release A
     1641\end{cfa}
     1642\end{multicols}
     1643\noindent The @wait@ only releases monitor @B@ so the signalling thread cannot acquire monitor @A@ to get to the @signal@.
     1644Attempting release of all acquired monitors at the @wait@ introduces a different set of problems, such as releasing monitor @C@, which has nothing to do with the @signal@.
     1645
     1646However, for monitors as for locks, it is possible to write a program using nesting without encountering any problems if nesting is done correctly.
     1647For example, the next cfa-code snippet acquires monitors {\sf A} then {\sf B} before waiting, while only acquiring {\sf B} when signalling, effectively avoiding the Nested Monitor Problem~\cite{Lister77}.
     1648
     1649\begin{multicols}{2}
     1650\begin{cfa}
     1651acquire A
     1652        acquire B
     1653                wait B
     1654        release B
     1655release A
     1656\end{cfa}
     1657
     1658\columnbreak
     1659
     1660\begin{cfa}
     1661
     1662acquire B
     1663        signal B
     1664release B
     1665
     1666\end{cfa}
     1667\end{multicols}
     1668
     1669\noindent However, this simple refactoring may not be possible, forcing more complex restructuring.
     1670
     1671% ======================================================================
     1672% ======================================================================
     1673\subsection{Internal Scheduling - In Depth}
     1674% ======================================================================
     1675% ======================================================================
     1676
     1677A larger example is presented to show complex issues for \textbf{bulk-acq} and its implementation options are analyzed.
     1678Figure~\ref{f:int-bulk-cfa} shows an example where \textbf{bulk-acq} adds a significant layer of complexity to the internal signalling semantics, and listing \ref{f:int-bulk-cfa} shows the corresponding \CFA code to implement the cfa-code in listing \ref{f:int-bulk-cfa}.
     1679For the purpose of translating the given cfa-code into \CFA-code, any method of introducing a monitor is acceptable, \eg @mutex@ parameters, global variables, pointer parameters, or using locals with the @mutex@ statement.
     1680
     1681\begin{figure}
     1682\begin{multicols}{2}
     1683Waiting thread
     1684\begin{cfa}[numbers=left]
     1685acquire A
     1686        // Code Section 1
     1687        acquire A & B
     1688                // Code Section 2
     1689                wait A & B
     1690                // Code Section 3
     1691        release A & B
     1692        // Code Section 4
     1693release A
     1694\end{cfa}
     1695\columnbreak
     1696Signalling thread
     1697\begin{cfa}[numbers=left, firstnumber=10,escapechar=|]
     1698acquire A
     1699        // Code Section 5
     1700        acquire A & B
     1701                // Code Section 6
     1702                |\label{line:signal1}|signal A & B
     1703                // Code Section 7
     1704        |\label{line:releaseFirst}|release A & B
     1705        // Code Section 8
     1706|\label{line:lastRelease}|release A
     1707\end{cfa}
     1708\end{multicols}
     1709\begin{cfa}[caption={Internal scheduling with \textbf{bulk-acq}},label={f:int-bulk-cfa}]
     1710\end{cfa}
     1711\begin{center}
     1712\begin{cfa}[xleftmargin=.4\textwidth]
     1713monitor A a;
     1714monitor B b;
     1715condition c;
     1716\end{cfa}
     1717\end{center}
     1718\begin{multicols}{2}
     1719Waiting thread
     1720\begin{cfa}
     1721mutex(a) {
     1722        // Code Section 1
     1723        mutex(a, b) {
     1724                // Code Section 2
     1725                wait(c);
     1726                // Code Section 3
    14281727        }
    1429 }
    1430 
    1431 \end{cfa}
    1432 \end{tabular}
    1433 \end{cquote}
    1434 
    1435 
    1436 \section{Scheduling}
    1437 \label{s:Scheduling}
    1438 
    1439 While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed.
    1440 For example, Figure~\ref{f:GenericBoundedBuffer} shows a bounded buffer that may be full/empty so produce/consumer threads must block.
    1441 Leaving the monitor and trying again (busy waiting) is impractical for high-level programming.
    1442 Monitors eliminate busy waiting by providing internal synchronization to schedule threads needing access to the shared data, where the synchronization is blocking (threads are parked) versus spinning.
    1443 Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} defines which thread acquires the critical section next.
    1444 \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry.
    1445 
    1446 Figure~\ref{f:BBInt} shows a \CFA bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@.
    1447 The @wait@ routine atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the routine's parameter list.
    1448 The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer.
    1449 Signalling is unconditional, because signalling an empty condition lock does nothing.
    1450 
    1451 Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means:
    1452 \begin{enumerate}
    1453 \item
    1454 The signalling thread returns immediately, and the signalled thread continues.
    1455 \item
    1456 The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait).
    1457 \item
    1458 The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues.
    1459 \end{enumerate}
    1460 The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service.
    1461 \CFA supports the next two semantics as both are useful.
    1462 Finally, while it is common to store a @condition@ as a field of the monitor, in \CFA, a @condition@ variable can be created/stored independently.
    1463 Furthermore, a condition variable is tied to a \emph{group} of monitors on first use (called \newterm{branding}), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors.
    1464 
    1465 \begin{figure}
    1466 \centering
    1467 \newbox\myboxA
    1468 \begin{lrbox}{\myboxA}
    1469 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1470 forall( otype T ) { // distribute forall
    1471         monitor Buffer {
    1472                 `condition` full, empty;
    1473                 int front, back, count;
    1474                 T elements[10];
    1475         };
    1476         void ?{}( Buffer(T) & buffer ) with(buffer) {
    1477                 [front, back, count] = 0;
     1728        // Code Section 4
     1729}
     1730\end{cfa}
     1731\columnbreak
     1732Signalling thread
     1733\begin{cfa}
     1734mutex(a) {
     1735        // Code Section 5
     1736        mutex(a, b) {
     1737                // Code Section 6
     1738                signal(c);
     1739                // Code Section 7
    14781740        }
    1479 
    1480         void insert( Buffer(T) & mutex buffer, T elem )
    1481                                 with(buffer) {
    1482                 if ( count == 10 ) `wait( empty )`;
    1483                 // insert elem into buffer
    1484                 `signal( full )`;
    1485         }
    1486         T remove( Buffer(T) & mutex buffer ) with(buffer) {
    1487                 if ( count == 0 ) `wait( full )`;
    1488                 // remove elem from buffer
    1489                 `signal( empty )`;
    1490                 return elem;
    1491         }
    1492 }
    1493 \end{cfa}
    1494 \end{lrbox}
    1495 
    1496 \newbox\myboxB
    1497 \begin{lrbox}{\myboxB}
    1498 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1499 forall( otype T ) { // distribute forall
    1500         monitor Buffer {
    1501 
    1502                 int front, back, count;
    1503                 T elements[10];
    1504         };
    1505         void ?{}( Buffer(T) & buffer ) with(buffer) {
    1506                 [front, back, count] = 0;
    1507         }
    1508         T remove( Buffer(T) & mutex buffer ); // forward
    1509         void insert( Buffer(T) & mutex buffer, T elem )
    1510                                 with(buffer) {
    1511                 if ( count == 10 ) `waitfor( remove, buffer )`;
    1512                 // insert elem into buffer
    1513 
    1514         }
    1515         T remove( Buffer(T) & mutex buffer ) with(buffer) {
    1516                 if ( count == 0 ) `waitfor( insert, buffer )`;
    1517                 // remove elem from buffer
    1518 
    1519                 return elem;
    1520         }
    1521 }
    1522 \end{cfa}
    1523 \end{lrbox}
    1524 
    1525 \subfloat[Internal Scheduling]{\label{f:BBInt}\usebox\myboxA}
    1526 %\qquad
    1527 \subfloat[External Scheduling]{\label{f:BBExt}\usebox\myboxB}
    1528 \caption{Generic Bounded-Buffer}
    1529 \label{f:GenericBoundedBuffer}
     1741        // Code Section 8
     1742}
     1743\end{cfa}
     1744\end{multicols}
     1745\begin{cfa}[caption={Equivalent \CFA code for listing \ref{f:int-bulk-cfa}},label={f:int-bulk-cfa}]
     1746\end{cfa}
     1747\begin{multicols}{2}
     1748Waiter
     1749\begin{cfa}[numbers=left]
     1750acquire A
     1751        acquire A & B
     1752                wait A & B
     1753        release A & B
     1754release A
     1755\end{cfa}
     1756
     1757\columnbreak
     1758
     1759Signaller
     1760\begin{cfa}[numbers=left, firstnumber=6,escapechar=|]
     1761acquire A
     1762        acquire A & B
     1763                signal A & B
     1764        release A & B
     1765        |\label{line:secret}|// Secretly keep B here
     1766release A
     1767// Wakeup waiter and transfer A & B
     1768\end{cfa}
     1769\end{multicols}
     1770\begin{cfa}[caption={Figure~\ref{f:int-bulk-cfa}, with delayed signalling comments},label={f:int-secret}]
     1771\end{cfa}
    15301772\end{figure}
    15311773
    1532 Figure~\ref{f:BBExt} shows a \CFA bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot.
    1533 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion.
    1534 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer.
    1535 Threads making calls to routines that are currently excluded block outside (external) of the monitor on a calling queue, versus blocking on condition queues inside (internal) of the monitor.
    1536 % External scheduling is more constrained and explicit, which helps programmers reduce the non-deterministic nature of concurrency.
    1537 External scheduling allows users to wait for events from other threads without concern of unrelated events occurring.
    1538 The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels.
    1539 While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics.
    1540 Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}).
    1541 
    1542 For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread;
    1543 the signaller enters the monitor and changes state, detects a waiting threads that can use the state, performs a non-blocking signal on the condition queue for the waiting thread, and exits the monitor to run concurrently.
    1544 The waiter unblocks next, uses/takes the state, and exits the monitor.
    1545 Blocking signalling is the reverse, where the waiter is providing the cooperation for the signalling thread;
    1546 the signaller enters the monitor, detects a waiting thread providing the necessary state, performs a blocking signal to place it on the urgent queue and unblock the waiter.
    1547 The waiter changes state and exits the monitor, and the signaller unblocks next from the urgent queue to use/take the state.
    1548 
    1549 Figure~\ref{f:DatingService} shows a dating service demonstrating the two forms of signalling: non-blocking and blocking.
    1550 The dating service matches girl and boy threads with matching compatibility codes so they can exchange phone numbers.
    1551 A thread blocks until an appropriate partner arrives.
    1552 The complexity is exchanging phone number in the monitor because the monitor mutual-exclusion property prevents exchanging numbers.
    1553 For internal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the oppose number, post its phone number, and unblock the partner.
    1554 For external scheduling, the implicit urgent-condition replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher..
    1555 
    1556 The dating service is an example of a monitor that cannot be written using external scheduling because it requires knowledge of calling parameters to make scheduling decisions, and parameters of waiting threads are unavailable;
    1557 as well, an arriving thread may not find a partner and must wait, which requires a condition variable, and condition variables imply internal scheduling.
    1558 
    1559 \begin{figure}
    1560 \centering
    1561 \newbox\myboxA
    1562 \begin{lrbox}{\myboxA}
    1563 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1564 enum { CCodes = 20 };
    1565 monitor DS {
    1566         int GirlPhNo, BoyPhNo;
    1567         condition Girls[CCodes], Boys[CCodes];
    1568         condition exchange;
    1569 };
    1570 int girl( DS & mutex ds, int phNo, int ccode ) {
    1571         if ( is_empty( Boys[ccode] ) ) {
    1572                 wait( Girls[ccode] );
    1573                 GirlPhNo = phNo;
    1574                 exchange.signal();
    1575         } else {
    1576                 GirlPhNo = phNo;
    1577                 signal( Boys[ccode] );
    1578                 exchange.wait();
    1579         } // if
    1580         return BoyPhNo;
    1581 }
    1582 int boy( DS & mutex ds, int phNo, int ccode ) {
    1583         // as above with boy/girl interchanged
    1584 }
    1585 \end{cfa}
    1586 \end{lrbox}
    1587 
    1588 \newbox\myboxB
    1589 \begin{lrbox}{\myboxB}
    1590 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1591 
    1592 monitor DS {
    1593         int GirlPhNo, BoyPhNo;
    1594         condition Girls[CCodes], Boys[CCodes];
    1595 
    1596 };
    1597 int girl( DS & mutex ds, int phNo, int ccode ) {
    1598         if ( is_empty( Boys[ccode] ) ) { // no compatible
    1599                 wait( Girls[ccode] ); // wait for boy
    1600                 GirlPhNo = phNo; // make phone number available
    1601 
    1602         } else {
    1603                 GirlPhNo = phNo; // make phone number available
    1604                 signal_block( Boys[ccode] ); // restart boy
    1605 
    1606         } // if
    1607         return BoyPhNo;
    1608 }
    1609 int boy( DS & mutex ds, int phNo, int ccode ) {
    1610         // as above with boy/girl interchanged
    1611 }
    1612 \end{cfa}
    1613 \end{lrbox}
    1614 
    1615 \subfloat[\lstinline@signal@]{\label{f:DatingSignal}\usebox\myboxA}
    1616 \qquad
    1617 \subfloat[\lstinline@signal_block@]{\label{f:DatingSignalBlock}\usebox\myboxB}
    1618 \caption{Dating service. }
    1619 \label{f:DatingService}
    1620 \end{figure}
    1621 
    1622 Both internal and external scheduling extend to multiple monitors in a natural way.
    1623 \begin{cquote}
    1624 \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}}
    1625 \begin{cfa}
    1626 monitor M { `condition e`; ... };
    1627 void foo( M & mutex m1, M & mutex m2 ) {
    1628         ... wait( `e` ); ...   // wait( e, m1, m2 )
    1629         ... wait( `e, m1` ); ...
    1630         ... wait( `e, m2` ); ...
    1631 }
    1632 \end{cfa}
    1633 &
    1634 \begin{cfa}
    1635 void rtn$\(_1\)$( M & mutex m1, M & mutex m2 );
    1636 void rtn$\(_2\)$( M & mutex m1 );
    1637 void bar( M & mutex m1, M & mutex m2 ) {
    1638         ... waitfor( `rtn` ); ...       // $\LstCommentStyle{waitfor( rtn\(_1\), m1, m2 )}$
    1639         ... waitfor( `rtn, m1` ); ... // $\LstCommentStyle{waitfor( rtn\(_2\), m1 )}$
    1640 }
    1641 \end{cfa}
    1642 \end{tabular}
    1643 \end{cquote}
    1644 For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex types in the parameter list, \ie @wait( e, m1, m2 )@.
    1645 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@.
    1646 Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe.
    1647 Finally, a signaller,
    1648 \begin{cfa}
    1649 void baz( M & mutex m1, M & mutex m2 ) {
    1650         ... signal( e ); ...
    1651 }
    1652 \end{cfa}
    1653 must have acquired monitor locks that are greater than or equal to the number of locks for the waiting thread signalled from the condition queue.
    1654 
    1655 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@.
    1656 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@.
    1657 Waitfor statically verifies the released monitors are the same as the acquired mutex-parameters of the given routine or routine pointer.
    1658 To statically verify the released monitors match with the accepted routine's mutex parameters, the routine (pointer) prototype must be accessible.
    1659 
    1660 Given the ability to release a subset of acquired monitors can result in a \newterm{nested monitor}~\cite{Lister77} deadlock.
    1661 \begin{cfa}
    1662 void foo( M & mutex m1, M & mutex m2 ) {
    1663         ... wait( `e, m1` ); ...                                $\C{// release m1, keeping m2 acquired )}$
    1664 void bar( M & mutex m1, M & mutex m2 ) {        $\C{// must acquire m1 and m2 )}$
    1665         ... signal( `e` ); ...
    1666 \end{cfa}
    1667 The @wait@ only releases @m1@ so the signalling thread cannot acquire both @m1@ and @m2@ to  enter @bar@ to get to the @signal@.
    1668 While deadlock issues can occur with multiple/nesting acquisition, this issue results from the fact that locks, and by extension monitors, are not perfectly composable.
    1669 
    1670 Finally, an important aspect of monitor implementation is barging, \ie can calling threads barge ahead of signalled threads?
    1671 If barging is allowed, synchronization between a singller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met).
    1672 \begin{quote}
    1673 However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program.
    1674 It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74}
    1675 \end{quote}
    1676 \CFA scheduling \emph{precludes} barging, which simplifies synchronization among threads in the monitor and increases correctness.
    1677 For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}.
    1678 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency.
    1679 
    1680 
    1681 \subsection{Barging Prevention}
    1682 
    1683 Figure~\ref{f:BargingPrevention} shows \CFA code where bulk acquire adds complexity to the internal-signalling semantics.
    1684 The complexity begins at the end of the inner @mutex@ statement, where the semantics of internal scheduling need to be extended for multiple monitors.
    1685 The problem is that bulk acquire is used in the inner @mutex@ statement where one of the monitors is already acquired.
    1686 When the signalling thread reaches the end of the inner @mutex@ statement, it should transfer ownership of @m1@ and @m2@ to the waiting threads to prevent barging into the outer @mutex@ statement by another thread.
    1687 However, both the signalling and waiting thread W1 still need monitor @m1@.
    1688 
    1689 \begin{figure}
    1690 \newbox\myboxA
    1691 \begin{lrbox}{\myboxA}
    1692 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1693 monitor M m1, m2;
    1694 condition c;
    1695 mutex( m1 ) { // $\LstCommentStyle{\color{red}outer}$
    1696         ...
    1697         mutex( m1, m2 ) { // $\LstCommentStyle{\color{red}inner}$
    1698                 ... `signal( c )`; ...
    1699                 // m1, m2 acquired
    1700         } // $\LstCommentStyle{\color{red}release m2}$
    1701         // m1 acquired
    1702 } // release m1
    1703 \end{cfa}
    1704 \end{lrbox}
    1705 
    1706 \newbox\myboxB
    1707 \begin{lrbox}{\myboxB}
    1708 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1709 
    1710 
    1711 mutex( m1 ) {
    1712         ...
    1713         mutex( m1, m2 ) {
    1714                 ... `wait( c )`; // block and release m1, m2
    1715                 // m1, m2 acquired
    1716         } // $\LstCommentStyle{\color{red}release m2}$
    1717         // m1 acquired
    1718 } // release m1
    1719 \end{cfa}
    1720 \end{lrbox}
    1721 
    1722 \newbox\myboxC
    1723 \begin{lrbox}{\myboxC}
    1724 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    1725 
    1726 
    1727 mutex( m2 ) {
    1728         ... `wait( c )`; ...
    1729         // m2 acquired
    1730 } // $\LstCommentStyle{\color{red}release m2}$
    1731 
    1732 
    1733 
    1734 
    1735 \end{cfa}
    1736 \end{lrbox}
    1737 
    1738 \begin{cquote}
    1739 \subfloat[Signalling Thread]{\label{f:SignallingThread}\usebox\myboxA}
    1740 \hspace{2\parindentlnth}
    1741 \subfloat[Waiting Thread (W1)]{\label{f:WaitingThread}\usebox\myboxB}
    1742 \hspace{2\parindentlnth}
    1743 \subfloat[Waiting Thread (W2)]{\label{f:OtherWaitingThread}\usebox\myboxC}
    1744 \end{cquote}
    1745 \caption{Barging Prevention}
    1746 \label{f:BargingPrevention}
    1747 \end{figure}
    1748 
    1749 One scheduling solution is for the signaller to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling.
    1750 However, Figure~\ref{f:OtherWaitingThread} shows this solution is complex depending on other waiters, resulting is choices when the signaller finishes the inner mutex-statement.
    1751 The singaller can retain @m2@ until completion of the outer mutex statement and pass the locks to waiter W1, or it can pass @m2@ to waiter W2 after completing the inner mutex-statement, while continuing to hold @m1@.
    1752 In the latter case, waiter W2 must eventually pass @m2@ to waiter W1, which is complex because W1 may have waited before W2, so W2 is unaware of it.
    1753 Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves.
    1754 
    1755 While a number of approaches were examined~\cite[\S~4.3]{Delisle18}, the solution chosen for \CFA is a novel techique called \newterm{partial signalling}.
    1756 Signalled threads are moved to an urgent queue and the waiter at the front defines the set of monitors necessary for it to unblock.
    1757 Partial signalling transfers ownership of monitors to the front waiter.
    1758 When the signaller thread exits or waits in the monitor the front waiter is unblocked if all its monitors are released.
    1759 This solution has the benefit that complexity is encapsulated into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met.
    1760 
    1761 \begin{comment}
     1774The complexity begins at code sections 4 and 8 in listing \ref{f:int-bulk-cfa}, which are where the existing semantics of internal scheduling needs to be extended for multiple monitors.
     1775The root of the problem is that \textbf{bulk-acq} is used in a context where one of the monitors is already acquired, which is why it is important to define the behaviour of the previous cfa-code.
     1776When the signaller thread reaches the location where it should ``release @A & B@'' (listing \ref{f:int-bulk-cfa} line \ref{line:releaseFirst}), it must actually transfer ownership of monitor @B@ to the waiting thread.
     1777This ownership transfer is required in order to prevent barging into @B@ by another thread, since both the signalling and signalled threads still need monitor @A@.
     1778There are three options:
     1779
     1780\subsubsection{Delaying Signals}
     1781The obvious solution to the problem of multi-monitor scheduling is to keep ownership of all locks until the last lock is ready to be transferred.
     1782It can be argued that that moment is when the last lock is no longer needed, because this semantics fits most closely to the behaviour of single-monitor scheduling.
     1783This solution has the main benefit of transferring ownership of groups of monitors, which simplifies the semantics from multiple objects to a single group of objects, effectively making the existing single-monitor semantic viable by simply changing monitors to monitor groups.
     1784This solution releases the monitors once every monitor in a group can be released.
     1785However, since some monitors are never released (\eg the monitor of a thread), this interpretation means a group might never be released.
     1786A more interesting interpretation is to transfer the group until all its monitors are released, which means the group is not passed further and a thread can retain its locks.
     1787
     1788However, listing \ref{f:int-secret} shows this solution can become much more complicated depending on what is executed while secretly holding B at line \ref{line:secret}, while avoiding the need to transfer ownership of a subset of the condition monitors.
    17621789Figure~\ref{f:dependency} shows a slightly different example where a third thread is waiting on monitor @A@, using a different condition variable.
    17631790Because the third thread is signalled when secretly holding @B@, the goal  becomes unreachable.
     
    17731800In both cases, the threads need to be able to distinguish, on a per monitor basis, which ones need to be released and which ones need to be transferred, which means knowing when to release a group becomes complex and inefficient (see next section) and therefore effectively precludes this approach.
    17741801
    1775 
    17761802\subsubsection{Dependency graphs}
     1803
    17771804
    17781805\begin{figure}
     
    18511878The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold.
    18521879Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one.
    1853 \end{comment}
    1854 
    1855 
    1856 \begin{comment}
     1880
     1881\subsubsection{Partial Signalling} \label{partial-sig}
     1882Finally, the solution that is chosen for \CFA is to use partial signalling.
     1883Again using listing \ref{f:int-bulk-cfa}, the partial signalling solution transfers ownership of monitor @B@ at lines \ref{line:signal1} to the waiter but does not wake the waiting thread since it is still using monitor @A@.
     1884Only when it reaches line \ref{line:lastRelease} does it actually wake up the waiting thread.
     1885This solution has the benefit that complexity is encapsulated into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met.
     1886This solution has a much simpler implementation than a dependency graph solving algorithms, which is why it was chosen.
     1887Furthermore, after being fully implemented, this solution does not appear to have any significant downsides.
     1888
     1889Using partial signalling, listing \ref{f:dependency} can be solved easily:
     1890\begin{itemize}
     1891        \item When thread $\gamma$ reaches line \ref{line:release-ab} it transfers monitor @B@ to thread $\alpha$ and continues to hold monitor @A@.
     1892        \item When thread $\gamma$ reaches line \ref{line:release-a}  it transfers monitor @A@ to thread $\beta$  and wakes it up.
     1893        \item When thread $\beta$  reaches line \ref{line:release-aa} it transfers monitor @A@ to thread $\alpha$ and wakes it up.
     1894\end{itemize}
     1895
     1896% ======================================================================
     1897% ======================================================================
     1898\subsection{Signalling: Now or Later}
     1899% ======================================================================
     1900% ======================================================================
     1901\begin{table}
     1902\begin{tabular}{|c|c|}
     1903@signal@ & @signal_block@ \\
     1904\hline
     1905\begin{cfa}[tabsize=3]
     1906monitor DatingService {
     1907        // compatibility codes
     1908        enum{ CCodes = 20 };
     1909
     1910        int girlPhoneNo
     1911        int boyPhoneNo;
     1912};
     1913
     1914condition girls[CCodes];
     1915condition boys [CCodes];
     1916condition exchange;
     1917
     1918int girl(int phoneNo, int cfa) {
     1919        // no compatible boy ?
     1920        if(empty(boys[cfa])) {
     1921                wait(girls[cfa]);               // wait for boy
     1922                girlPhoneNo = phoneNo;          // make phone number available
     1923                signal(exchange);               // wake boy from chair
     1924        } else {
     1925                girlPhoneNo = phoneNo;          // make phone number available
     1926                signal(boys[cfa]);              // wake boy
     1927                wait(exchange);         // sit in chair
     1928        }
     1929        return boyPhoneNo;
     1930}
     1931int boy(int phoneNo, int cfa) {
     1932        // same as above
     1933        // with boy/girl interchanged
     1934}
     1935\end{cfa}&\begin{cfa}[tabsize=3]
     1936monitor DatingService {
     1937
     1938        enum{ CCodes = 20 };    // compatibility codes
     1939
     1940        int girlPhoneNo;
     1941        int boyPhoneNo;
     1942};
     1943
     1944condition girls[CCodes];
     1945condition boys [CCodes];
     1946// exchange is not needed
     1947
     1948int girl(int phoneNo, int cfa) {
     1949        // no compatible boy ?
     1950        if(empty(boys[cfa])) {
     1951                wait(girls[cfa]);               // wait for boy
     1952                girlPhoneNo = phoneNo;          // make phone number available
     1953                signal(exchange);               // wake boy from chair
     1954        } else {
     1955                girlPhoneNo = phoneNo;          // make phone number available
     1956                signal_block(boys[cfa]);                // wake boy
     1957
     1958                // second handshake unnecessary
     1959
     1960        }
     1961        return boyPhoneNo;
     1962}
     1963
     1964int boy(int phoneNo, int cfa) {
     1965        // same as above
     1966        // with boy/girl interchanged
     1967}
     1968\end{cfa}
     1969\end{tabular}
     1970\caption{Dating service example using \protect\lstinline|signal| and \protect\lstinline|signal_block|. }
     1971\label{tbl:datingservice}
     1972\end{table}
     1973An important note is that, until now, signalling a monitor was a delayed operation.
     1974The ownership of the monitor is transferred only when the monitor would have otherwise been released, not at the point of the @signal@ statement.
     1975However, in some cases, it may be more convenient for users to immediately transfer ownership to the thread that is waiting for cooperation, which is achieved using the @signal_block@ routine.
     1976
     1977The example in table \ref{tbl:datingservice} highlights the difference in behaviour.
     1978As mentioned, @signal@ only transfers ownership once the current critical section exits; this behaviour requires additional synchronization when a two-way handshake is needed.
     1979To avoid this explicit synchronization, the @condition@ type offers the @signal_block@ routine, which handles the two-way handshake as shown in the example.
     1980This feature removes the need for a second condition variables and simplifies programming.
     1981Like every other monitor semantic, @signal_block@ uses barging prevention, which means mutual-exclusion is baton-passed both on the front end and the back end of the call to @signal_block@, meaning no other thread can acquire the monitor either before or after the call.
     1982
     1983% ======================================================================
     1984% ======================================================================
    18571985\section{External scheduling} \label{extsched}
    1858 
     1986% ======================================================================
     1987% ======================================================================
     1988An alternative to internal scheduling is external scheduling (see Table~\ref{tbl:sched}).
    18591989\begin{table}
    18601990\begin{tabular}{|c|c|c|}
     
    19202050\label{tbl:sched}
    19212051\end{table}
     2052This method is more constrained and explicit, which helps users reduce the non-deterministic nature of concurrency.
     2053Indeed, as the following examples demonstrate, external scheduling allows users to wait for events from other threads without the concern of unrelated events occurring.
     2054External scheduling can generally be done either in terms of control flow (\eg Ada with @accept@, \uC with @_Accept@) or in terms of data (\eg Go with channels).
     2055Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics.
     2056Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines.
     2057The previous example shows a simple use @_Accept@ versus @wait@/@signal@ and its advantages.
     2058Note that while other languages often use @accept@/@select@ as the core external scheduling keyword, \CFA uses @waitfor@ to prevent name collisions with existing socket \textbf{api}s.
    19222059
    19232060For the @P@ member above using internal scheduling, the call to @wait@ only guarantees that @V@ is the last routine to access the monitor, allowing a third routine, say @isInUse()@, acquire mutual exclusion several times while routine @P@ is waiting.
    19242061On the other hand, external scheduling guarantees that while routine @P@ is waiting, no other routine than @V@ can acquire the monitor.
    1925 \end{comment}
    1926 
    1927 
     2062
     2063% ======================================================================
     2064% ======================================================================
    19282065\subsection{Loose Object Definitions}
    1929 \label{s:LooseObjectDefinitions}
    1930 
    1931 In an object-oriented programming-language, a class includes an exhaustive list of operations.
    1932 However, new members can be added via static inheritance or dynaic members, \eg JavaScript~\cite{JavaScript}.
    1933 Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement.
    1934 \begin{cfa}
    1935 monitor M {};
    1936 void `f`( M & mutex m );
    1937 void g( M & mutex m ) { waitfor( `f` ); }       $\C{// clear which f}$
    1938 void `f`( M & mutex m, int );                           $\C{// different f}$
    1939 void h( M & mutex m ) { waitfor( `f` ); }       $\C{// unclear which f}$
    1940 \end{cfa}
    1941 Hence, the cfa-code for the entering a monitor looks like:
    1942 \begin{cfa}
    1943 if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$
    1944 else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$
    1945 else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$
    1946 else $\LstCommentStyle{// \color{red}block}$
    1947 \end{cfa}
     2066% ======================================================================
     2067% ======================================================================
     2068In \uC, a monitor class declaration includes an exhaustive list of monitor operations.
     2069Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user:
     2070
     2071\begin{cfa}
     2072monitor A {};
     2073
     2074void f(A & mutex a);
     2075void g(A & mutex a) {
     2076        waitfor(f); // Obvious which f() to wait for
     2077}
     2078
     2079void f(A & mutex a, int); // New different F added in scope
     2080void h(A & mutex a) {
     2081        waitfor(f); // Less obvious which f() to wait for
     2082}
     2083\end{cfa}
     2084
     2085Furthermore, external scheduling is an example where implementation constraints become visible from the interface.
     2086Here is the cfa-code for the entering phase of a monitor:
     2087\begin{center}
     2088\begin{tabular}{l}
     2089\begin{cfa}
     2090        if monitor is free
     2091                enter
     2092        elif already own the monitor
     2093                continue
     2094        elif monitor accepts me
     2095                enter
     2096        else
     2097                block
     2098\end{cfa}
     2099\end{tabular}
     2100\end{center}
    19482101For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions.
    1949 However, a fast check for \emph{monitor accepts me} is much harder to implement depending on the constraints put on the monitors.
    1950 Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue.
     2102However, a fast check for @monitor accepts me@ is much harder to implement depending on the constraints put on the monitors.
     2103Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}.
    19512104
    19522105\begin{figure}
    19532106\centering
    1954 \subfloat[Classical monitor] {
     2107\subfloat[Classical Monitor] {
    19552108\label{fig:ClassicalMonitor}
    1956 {\resizebox{0.45\textwidth}{!}{\input{monitor.pstex_t}}}
     2109{\resizebox{0.45\textwidth}{!}{\input{monitor}}}
    19572110}% subfloat
    1958 \quad
    1959 \subfloat[Bulk acquire monitor] {
     2111\qquad
     2112\subfloat[\textbf{bulk-acq} Monitor] {
    19602113\label{fig:BulkMonitor}
    1961 {\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}}
     2114{\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}}
    19622115}% subfloat
    1963 \caption{Monitor Implementation}
    1964 \label{f:MonitorImplementation}
     2116\caption{External Scheduling Monitor}
    19652117\end{figure}
    19662118
    1967 For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction.
    1968 This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units.
    1969 For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines.
    1970 
    1971 Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation.
    1972 The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately.
    1973 The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search.
    1974 
    1975 \begin{comment}
     2119There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy.
     2120Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (\eg 128) of mutex members.
     2121This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units.
     2122For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance.
     2123However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit.
     2124This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects.
     2125
     2126The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}.
     2127Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate.
     2128Generating a mask dynamically means that the storage for the mask information can vary between calls to @waitfor@, allowing for more flexibility and extensions.
     2129Storing an array of accepted function pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search.
     2130Furthermore, supporting nested external scheduling (\eg listing \ref{f:nest-ext}) may now require additional searches for the @waitfor@ statement to check if a routine is already queued.
     2131
    19762132\begin{figure}
    19772133\begin{cfa}[caption={Example of nested external scheduling},label={f:nest-ext}]
     
    19892145\end{figure}
    19902146
    1991 Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a routine pointer and a set of monitors, as is discussed in the next section.
     2147Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a function pointer and a set of monitors, as is discussed in the next section.
    19922148These details are omitted from the picture for the sake of simplicity.
    19932149
     
    19972153In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be  hard to write.
    19982154This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA.
    1999 \end{comment}
    2000 
    2001 
     2155
     2156% ======================================================================
     2157% ======================================================================
    20022158\subsection{Multi-Monitor Scheduling}
    2003 \label{s:Multi-MonitorScheduling}
     2159% ======================================================================
     2160% ======================================================================
    20042161
    20052162External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax.
    2006 Even in the simplest possible case, new semantics needs to be established:
     2163Even in the simplest possible case, some new semantics needs to be established:
    20072164\begin{cfa}
    20082165monitor M {};
    2009 void f( M & mutex m1 );
    2010 void g( M & mutex m1, M & mutex m2 ) {
    2011         waitfor( f );                                                   $\C{// pass m1 or m2 to f?}$
    2012 }
    2013 \end{cfa}
    2014 The solution is for the programmer to disambiguate:
    2015 \begin{cfa}
    2016         waitfor( f, m2 );                                               $\C{// wait for call to f with argument m2}$
    2017 \end{cfa}
    2018 Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@ (while @g@ still holds lock @m1@).
    2019 This behaviour can be extended to the multi-monitor @waitfor@ statement.
     2166
     2167void f(M & mutex a);
     2168
     2169void g(M & mutex b, M & mutex c) {
     2170        waitfor(f); // two monitors M => unknown which to pass to f(M & mutex)
     2171}
     2172\end{cfa}
     2173The obvious solution is to specify the correct monitor as follows:
     2174
    20202175\begin{cfa}
    20212176monitor M {};
    2022 void f( M & mutex m1, M & mutex m2 );
    2023 void g( M & mutex m1, M & mutex m2 ) {
    2024         waitfor( f, m1, m2 );                                   $\C{// wait for call to f with arguments m1 and m2}$
    2025 }
    2026 \end{cfa}
    2027 Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine.
     2177
     2178void f(M & mutex a);
     2179
     2180void g(M & mutex a, M & mutex b) {
     2181        // wait for call to f with argument b
     2182        waitfor(f, b);
     2183}
     2184\end{cfa}
     2185This syntax is unambiguous.
     2186Both locks are acquired and kept by @g@.
     2187When routine @f@ is called, the lock for monitor @b@ is temporarily transferred from @g@ to @f@ (while @g@ still holds lock @a@).
     2188This behaviour can be extended to the multi-monitor @waitfor@ statement as follows.
     2189
     2190\begin{cfa}
     2191monitor M {};
     2192
     2193void f(M & mutex a, M & mutex b);
     2194
     2195void g(M & mutex a, M & mutex b) {
     2196        // wait for call to f with arguments a and b
     2197        waitfor(f, a, b);
     2198}
     2199\end{cfa}
     2200
     2201Note that the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired in the routine. @waitfor@ used in any other context is undefined behaviour.
    20282202
    20292203An important behaviour to note is when a set of monitors only match partially:
     2204
    20302205\begin{cfa}
    20312206mutex struct A {};
     2207
    20322208mutex struct B {};
    2033 void g( A & mutex m1, B & mutex m2 ) {
    2034         waitfor( f, m1, m2 );
    2035 }
     2209
     2210void g(A & mutex a, B & mutex b) {
     2211        waitfor(f, a, b);
     2212}
     2213
    20362214A a1, a2;
    20372215B b;
     2216
    20382217void foo() {
    2039         g( a1, b ); // block on accept
    2040 }
     2218        g(a1, b); // block on accept
     2219}
     2220
    20412221void bar() {
    2042         f( a2, b ); // fulfill cooperation
     2222        f(a2, b); // fulfill cooperation
    20432223}
    20442224\end{cfa}
     
    20472227It is also important to note that in the case of external scheduling the order of parameters is irrelevant; @waitfor(f,a,b)@ and @waitfor(f,b,a)@ are indistinguishable waiting condition.
    20482228
    2049 
     2229% ======================================================================
     2230% ======================================================================
    20502231\subsection{\protect\lstinline|waitfor| Semantics}
    2051 
    2052 Syntactically, the @waitfor@ statement takes a routine identifier and a set of monitors.
    2053 While the set of monitors can be any list of expressions, the routine name is more restricted because the compiler validates at compile time the validity of the routine type and the parameters used with the @waitfor@ statement.
    2054 It checks that the set of monitors passed in matches the requirements for a routine call.
     2232% ======================================================================
     2233% ======================================================================
     2234
     2235Syntactically, the @waitfor@ statement takes a function identifier and a set of monitors.
     2236While the set of monitors can be any list of expressions, the function name is more restricted because the compiler validates at compile time the validity of the function type and the parameters used with the @waitfor@ statement.
     2237It checks that the set of monitors passed in matches the requirements for a function call.
    20552238Figure~\ref{f:waitfor} shows various usages of the waitfor statement and which are acceptable.
    2056 The choice of the routine type is made ignoring any non-@mutex@ parameter.
     2239The choice of the function type is made ignoring any non-@mutex@ parameter.
    20572240One limitation of the current implementation is that it does not handle overloading, but overloading is possible.
    20582241\begin{figure}
     
    20802263        waitfor(f2, a1, a2); // Incorrect : Mutex arguments don't match
    20812264        waitfor(f1, 1);      // Incorrect : 1 not a mutex argument
    2082         waitfor(f9, a1);     // Incorrect : f9 routine does not exist
     2265        waitfor(f9, a1);     // Incorrect : f9 function does not exist
    20832266        waitfor(*fp, a1 );   // Incorrect : fp not an identifier
    20842267        waitfor(f4, a1);     // Incorrect : f4 ambiguous
     
    20902273
    20912274Finally, for added flexibility, \CFA supports constructing a complex @waitfor@ statement using the @or@, @timeout@ and @else@.
    2092 Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any routine that fits one of the routine+monitor set passed in.
    2093 To enable users to tell which accepted routine executed, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered.
    2094 A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching routine call already arrived and otherwise continues.
     2275Indeed, multiple @waitfor@ clauses can be chained together using @or@; this chain forms a single statement that uses baton pass to any function that fits one of the function+monitor set passed in.
     2276To enable users to tell which accepted function executed, @waitfor@s are followed by a statement (including the null statement @;@) or a compound statement, which is executed after the clause is triggered.
     2277A @waitfor@ chain can also be followed by a @timeout@, to signify an upper bound on the wait, or an @else@, to signify that the call should be non-blocking, which checks for a matching function call already arrived and otherwise continues.
    20952278Any and all of these clauses can be preceded by a @when@ condition to dynamically toggle the accept clauses on or off based on some current state.
    20962279Figure~\ref{f:waitfor2} demonstrates several complex masks and some incorrect ones.
     
    21472330\end{figure}
    21482331
    2149 
     2332% ======================================================================
     2333% ======================================================================
    21502334\subsection{Waiting For The Destructor}
    2151 
     2335% ======================================================================
     2336% ======================================================================
    21522337An interesting use for the @waitfor@ statement is destructor semantics.
    21532338Indeed, the @waitfor@ statement can accept any @mutex@ routine, which includes the destructor (see section \ref{data}).
     
    21762361
    21772362
     2363% ######     #    ######     #    #       #       ####### #       ###  #####  #     #
     2364% #     #   # #   #     #   # #   #       #       #       #        #  #     # ##   ##
     2365% #     #  #   #  #     #  #   #  #       #       #       #        #  #       # # # #
     2366% ######  #     # ######  #     # #       #       #####   #        #   #####  #  #  #
     2367% #       ####### #   #   ####### #       #       #       #        #        # #     #
     2368% #       #     # #    #  #     # #       #       #       #        #  #     # #     #
     2369% #       #     # #     # #     # ####### ####### ####### ####### ###  #####  #     #
    21782370\section{Parallelism}
    2179 
    21802371Historically, computer performance was about processor speeds and instruction counts.
    21812372However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}.
     
    21872378While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads.
    21882379
    2189 
    21902380\section{Paradigms}
    2191 
    2192 
    21932381\subsection{User-Level Threads}
    2194 
    21952382A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}.
    21962383These threads offer most of the same features that the operating system already provides but can be used on a much larger scale.
     
    22012388Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
    22022389
    2203 
    22042390\subsection{Fibers : User-Level Threads Without Preemption} \label{fibers}
    2205 
    22062391A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}.
    22072392However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}.
     
    22122397An example of a language that uses fibers is Go~\cite{Go}
    22132398
    2214 
    22152399\subsection{Jobs and Thread Pools}
    2216 
    22172400An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}.
    22182401Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface.
     
    22252408The gold standard of this implementation is Intel's TBB library~\cite{TBB}.
    22262409
    2227 
    22282410\subsection{Paradigm Performance}
    2229 
    22302411While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level.
    22312412Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload.
     
    22352416Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done.
    22362417
    2237 
    22382418\section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
    2239 
    22402419A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}.
    22412420It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues.
     
    22452424Currently \CFA only supports one \textbf{cfacluster}, the initial one.
    22462425
    2247 
    22482426\subsection{Future Work: Machine Setup}\label{machine}
    2249 
    22502427While this was not done in the context of this paper, another important aspect of clusters is affinity.
    22512428While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups.
     
    22532430OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
    22542431
    2255 
    22562432\subsection{Paradigms}\label{cfaparadigms}
    2257 
    22582433Given these building blocks, it is possible to reproduce all three of the popular paradigms.
    22592434Indeed, \textbf{uthread} is the default paradigm in \CFA.
     
    22632438
    22642439
     2440
    22652441\section{Behind the Scenes}
    2266 
    22672442There are several challenges specific to \CFA when implementing concurrency.
    2268 These challenges are a direct result of bulk acquire and loose object definitions.
     2443These challenges are a direct result of \textbf{bulk-acq} and loose object definitions.
    22692444These two constraints are the root cause of most design decisions in the implementation.
    22702445Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs.
     
    22742449The main memory concern for concurrency is queues.
    22752450All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation.
    2276 Since several concurrency operations can use an unbound amount of memory (depending on bulk acquire), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers.
     2451Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers.
    22772452Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays.
    22782453Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array.
    22792454The threads and the condition both have a fixed amount of memory, while @mutex@ routines and blocking calls allow for an unbound amount, within the stack size.
    22802455
    2281 Note that since the major contributions of this paper are extending monitor semantics to bulk acquire and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
    2282 
    2283 
     2456Note that since the major contributions of this paper are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
     2457
     2458% ======================================================================
     2459% ======================================================================
    22842460\section{Mutex Routines}
     2461% ======================================================================
     2462% ======================================================================
    22852463
    22862464The first step towards the monitor implementation is simple @mutex@ routines.
     
    23172495\end{figure}
    23182496
    2319 
    23202497\subsection{Details: Interaction with polymorphism}
    2321 
    23222498Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support.
    23232499However, it is shown that entry-point locking solves most of the issues.
     
    23882564void foo(T * mutex t);
    23892565
    2390 // Correct: this routine only works on monitors (any monitor)
     2566// Correct: this function only works on monitors (any monitor)
    23912567forall(dtype T | is_monitor(T))
    23922568void bar(T * mutex t));
     
    23952571Both entry point and \textbf{callsite-locking} are feasible implementations.
    23962572The current \CFA implementation uses entry-point locking because it requires less work when using \textbf{raii}, effectively transferring the burden of implementation to object construction/destruction.
    2397 It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the routine body.
     2573It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, \ie the function body.
    23982574For example, the monitor call can appear in the middle of an expression.
    23992575Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
    24002576
    2401 
     2577% ======================================================================
     2578% ======================================================================
    24022579\section{Threading} \label{impl:thread}
     2580% ======================================================================
     2581% ======================================================================
    24032582
    24042583Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency.
     
    24132592\end{figure}
    24142593
    2415 
    24162594\subsection{Processors}
    2417 
    24182595Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically @pthread@s in the current implementation of \CFA.
    24192596Indeed, any parallelism must go through operating-system libraries.
     
    24232600Processors internally use coroutines to take advantage of the existing context-switching semantics.
    24242601
    2425 
    24262602\subsection{Stack Management}
    2427 
    24282603One of the challenges of this system is to reduce the footprint as much as possible.
    24292604Specifically, all @pthread@s created also have a stack created with them, which should be used as much as possible.
     
    24322607In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large.
    24332608
    2434 
    24352609\subsection{Context Switching}
    2436 
    24372610As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks.
    2438 To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific routine call.
     2611To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call.
    24392612This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread.
    2440 Note that the instruction pointer can be left untouched since the context-switch is always inside the same routine
     2613Note that the instruction pointer can be left untouched since the context-switch is always inside the same function.
    24412614Threads, however, do not context-switch between each other directly.
    24422615They context-switch to the scheduler.
     
    24482621This option is not currently present in \CFA, but the changes required to add it are strictly additive.
    24492622
    2450 
    24512623\subsection{Preemption} \label{preemption}
    2452 
    24532624Finally, an important aspect for any complete threading system is preemption.
    24542625As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution.
     
    24772648As a result, a signal handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread).
    24782649It is important to note that signal handlers save and restore signal masks because user-thread migration can cause a signal mask to migrate from one kernel thread to another.
    2479 This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' routines from other routines}.
     2650This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' functions from other functions.}.
    24802651However, since the kernel thread handling preemption requires a different signal mask, executing user threads on the kernel-alarm thread can cause deadlocks.
    24812652For this reason, the alarm thread is in a tight loop around a system call to @sigwaitinfo@, requiring very little CPU time for preemption.
     
    24842655Indeed, @sigwait@ can differentiate signals sent from @pthread_sigqueue@ from signals sent from alarms or the kernel.
    24852656
    2486 
    24872657\subsection{Scheduler}
    24882658Finally, an aspect that was not mentioned yet is the scheduling algorithm.
     
    24902660Further discussion on scheduling is present in section \ref{futur:sched}.
    24912661
    2492 
     2662% ======================================================================
     2663% ======================================================================
    24932664\section{Internal Scheduling} \label{impl:intsched}
    2494 
     2665% ======================================================================
     2666% ======================================================================
    24952667The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience):
    24962668
    24972669\begin{figure}
    24982670\begin{center}
    2499 {\resizebox{0.4\textwidth}{!}{\input{monitor.pstex_t}}}
     2671{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
    25002672\end{center}
    25012673\caption{Traditional illustration of a monitor}
     
    25062678
    25072679For \CFA, this picture does not have support for blocking multiple monitors on a single condition.
    2508 To support bulk acquire two changes to this picture are required.
     2680To support \textbf{bulk-acq} two changes to this picture are required.
    25092681First, it is no longer helpful to attach the condition to \emph{a single} monitor.
    25102682Secondly, the thread waiting on the condition has to be separated across multiple monitors, seen in figure \ref{fig:monitor_cfa}.
     
    25552727\end{figure}
    25562728
    2557 The solution discussed in \ref{s:InternalScheduling} can be seen in the exit routine of listing \ref{f:entry2}.
     2729The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{f:entry2}.
    25582730Basically, the solution boils down to having a separate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership.
    25592731This solution is deadlock safe as well as preventing any potential barging.
     
    27912963}
    27922964\end{cfa}
    2793 This routine is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption.
     2965This function is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, \ie no preemption.
    27942966However, once clusters are fully implemented, it will be possible to create fibers and \textbf{uthread} in the same system, as in listing \ref{f:fiber-uthread}
    27952967\begin{figure}
     
    29763148For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine.
    29773149Figure~\ref{f:mutex} shows the code for \CFA.
    2978 To put the results in context, the cost of entering a non-inline routine and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
     3150To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
    29793151The results can be shown in table \ref{tab:mutex}.
    29803152
     
    32273399Therefore, there is still significant work to improve performance.
    32283400Many of the data structures and algorithms may change in the future to more efficient versions.
    3229 For example, the number of monitors in a single bulk acquire is only bound by the stack size, this is probably unnecessarily generous.
     3401For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous.
    32303402It may be possible that limiting the number helps increase performance.
    32313403However, it is not obvious that the benefit would be significant.
Note: See TracChangeset for help on using the changeset viewer.