Ignore:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r251454a0 r48b9b36  
    7070%\DeclareTextCommandDefault{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.1ex}}}
    7171\renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}}
     72%\def\myCHarFont{\fontencoding{T1}\selectfont}%
     73% \def\{{\ttfamily\upshape\myCHarFont \char`\}}}%
    7274
    7375\renewcommand*{\thefootnote}{\Alph{footnote}} % hack because fnsymbol does not work
     
    739741The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume.
    740742The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@;
    741 on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
     743on return, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
    742744The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack;
    743745when the coroutine main returns, its stack is deallocated.
    744746Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes.
    745747Figure~\ref{f:Coroutine1State} shows the coroutine version of the C version in Figure~\ref{f:ExternalState}.
    746 Coroutine generators are called \newterm{output coroutines} because values are only returned.
    747 
    748 Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of characters of fixed-size blocks.
     748Coroutine generators are called \newterm{output coroutines} because values are returned by the coroutine.
     749
     750Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of character blocks of fixed size.
    749751For example, the input of the left is reformatted into the output on the right.
    750752\begin{quote}
     
    761763\end{tabular}
    762764\end{quote}
    763 The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops.
     765The example takes advantage of resuming coroutines in the constructor to prime the coroutine loops so the first character sent for formatting appears inside the nested loops.
    764766The destruction provides a newline if formatted text ends with a full line.
    765767Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flatten (linearized) and rechecked on each call because execution location is not retained between calls.
     
    776778void main( Format & fmt ) with( fmt ) {
    777779        for ( ;; ) {   
    778                 for ( g = 0; g < 5; g += 1 ) {      // group
     780                for ( g = 0; g < 5; g += 1 ) {  // group
    779781                        for ( b = 0; b < 4; b += 1 ) { // block
    780782                                `suspend();`
     
    812814};
    813815void format( struct Format * fmt ) {
    814         if ( fmt->ch != -1 ) {      // not EOF ?
     816        if ( fmt->ch != -1 ) { // not EOF
    815817                printf( "%c", fmt->ch );
    816818                fmt->b += 1;
     
    821823                }
    822824                if ( fmt->g == 5 ) {  // group
    823                         printf( "\n" );     // separator
     825                        printf( "\n" );      // separator
    824826                        fmt->g = 0;
    825827                }
     
    848850
    849851The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions.
    850 However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones.
    851 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle.
     852However, there is no stack growth because @resume@/@suspend@ context switch to an existing stack frames rather than create a new one.
     853\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a cycle.
    852854(The trivial cycle is a coroutine resuming itself.)
    853855This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch.
     
    933935The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
    934936Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
    935 @prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer.
     937@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random vales, calling the consumer to deliver the values, and printing the status returned from the consumer.
    936938
    937939The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
    938940For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine.
    939 The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).
    940 The call from the consumer to the @payment@ introduces the cycle between producer and consumer.
     941The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer's @payment@ member, and on return prints the receipt from the producer and increments the money for the next payment.
     942The call from the consumer to the producer's @payment@ member introduces the cycle between producer and consumer.
    941943When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed.
    942 The context switch restarts the producer at the point where it was last context switched, so it continues in @delivery@ after the resume.
    943 
    944 @delivery@ returns the status value in @prod@'s coroutine main, where the status is printed.
     944The context switch restarts the producer at the point where it was last context switched and it continues in member @delivery@ after the resume.
     945
     946The @delivery@ member returns the status value in @prod@'s @main@ member, where the status is printed.
    945947The loop then repeats calling @delivery@, where each call resumes the consumer coroutine.
    946948The context switch to the consumer continues in @payment@.
    947 The consumer increments and returns the receipt to the call in @cons@'s coroutine main.
     949The consumer increments and returns the receipt to the call in @cons@'s @main@ member.
    948950The loop then repeats calling @payment@, where each call resumes the producer coroutine.
    949951
     
    952954The context switch restarts @cons@ in @payment@ and it returns with the last receipt.
    953955The consumer terminates its loops because @done@ is true, its @main@ terminates, so @cons@ transitions from a coroutine back to an object, and @prod@ reactivates after the resume in @stop@.
    954 @stop@ returns and @prod@'s coroutine main terminates.
     956The @stop@ member returns and @prod@'s @main@ member terminates.
    955957The program main restarts after the resume in @start@.
    956 @start@ returns and the program main terminates.
    957 
    958 
    959 \subsection{Coroutine Implementation}
    960 
    961 A significant implementation challenge for coroutines (and threads, see section \ref{threads}) is adding extra fields and executing code after/before the coroutine constructor/destructor and coroutine main to create/initialize/de-initialize/destroy extra fields and the stack.
    962 There are several solutions to this problem and the chosen option forced the \CFA coroutine design.
    963 
    964 Object-oriented inheritance provides extra fields and code in a restricted context, but it requires programmers to explicitly perform the inheritance:
    965 \begin{cfa}
    966 struct mycoroutine $\textbf{\textsf{inherits}}$ baseCoroutine { ... }
    967 \end{cfa}
    968 and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack.
    969 Furthermore, the execution of constructs/destructors is in the wrong order for certain operations, \eg for threads;
    970 \eg, if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
    971 
    972 An alternatively is composition:
    973 \begin{cfa}
    974 struct mycoroutine {
    975         ... // declarations
    976         baseCoroutine dummy; // composition, last declaration
    977 }
    978 \end{cfa}
    979 which also requires an explicit declaration that must be the last one to ensure correct initialization order.
    980 However, there is nothing preventing wrong placement or multiple declarations.
     958The @start@ member returns and the program main terminates.
     959
     960
     961\subsubsection{Construction}
     962
     963One important design challenge for implementing coroutines and threads (shown in section \ref{threads}) is that the runtime system needs to run code after the user-constructor runs to connect the fully constructed object into the system.
     964In the case of coroutines, this challenge is simpler since there is no non-determinism from preemption or scheduling.
     965However, the underlying challenge remains the same for coroutines and threads.
     966
     967The runtime system needs to create the coroutine's stack and, more importantly, prepare it for the first resumption.
     968The timing of the creation is non-trivial since users expect both to have fully constructed objects once execution enters the coroutine main and to be able to resume the coroutine from the constructor.
     969There are several solutions to this problem but the chosen option effectively forces the design of the coroutine.
     970
     971Furthermore, \CFA faces an extra challenge as polymorphic routines create invisible thunks when cast to non-polymorphic routines and these thunks have function scope.
     972For example, the following code, while looking benign, can run into undefined behaviour because of thunks:
     973
     974\begin{cfa}
     975// async: Runs function asynchronously on another thread
     976forall(otype T)
     977extern void async(void (*func)(T*), T* obj);
     978
     979forall(otype T)
     980void noop(T*) {}
     981
     982void bar() {
     983        int a;
     984        async(noop, &a); // start thread running noop with argument a
     985}
     986\end{cfa}
     987
     988The generated C code\footnote{Code trimmed down for brevity} creates a local thunk to hold type information:
     989
     990\begin{cfa}
     991extern void async(/* omitted */, void (*func)(void*), void* obj);
     992
     993void noop(/* omitted */, void* obj){}
     994
     995void bar(){
     996        int a;
     997        void _thunk0(int* _p0){
     998                /* omitted */
     999                noop(/* omitted */, _p0);
     1000        }
     1001        /* omitted */
     1002        async(/* omitted */, ((void (*)(void*))(&_thunk0)), (&a));
     1003}
     1004\end{cfa}
     1005The problem in this example is a storage management issue, the function pointer @_thunk0@ is only valid until the end of the block, which limits the viable solutions because storing the function pointer for too long causes undefined behaviour; \ie the stack-based thunk being destroyed before it can be used.
     1006This challenge is an extension of challenges that come with second-class routines.
     1007Indeed, GCC nested routines also have the limitation that nested routine cannot be passed outside of the declaration scope.
     1008The case of coroutines and threads is simply an extension of this problem to multiple call stacks.
     1009
     1010
     1011\subsubsection{Alternative: Composition}
     1012
     1013One solution to this challenge is to use composition/containment, where coroutine fields are added to manage the coroutine.
     1014
     1015\begin{cfa}
     1016struct Fibonacci {
     1017        int fn; // used for communication
     1018        coroutine c; // composition
     1019};
     1020
     1021void FibMain(void*) {
     1022        //...
     1023}
     1024
     1025void ?{}(Fibonacci& this) {
     1026        this.fn = 0;
     1027        // Call constructor to initialize coroutine
     1028        (this.c){myMain};
     1029}
     1030\end{cfa}
     1031The downside of this approach is that users need to correctly construct the coroutine handle before using it.
     1032Like any other objects, the user must carefully choose construction order to prevent usage of objects not yet constructed.
     1033However, in the case of coroutines, users must also pass to the coroutine information about the coroutine main, like in the previous example.
     1034This opens the door for user errors and requires extra runtime storage to pass at runtime information that can be known statically.
     1035
     1036
     1037\subsubsection{Alternative: Reserved keyword}
     1038
     1039The next alternative is to use language support to annotate coroutines as follows:
     1040\begin{cfa}
     1041coroutine Fibonacci {
     1042        int fn; // used for communication
     1043};
     1044\end{cfa}
     1045The @coroutine@ keyword means the compiler can find and inject code where needed.
     1046The downside of this approach is that it makes coroutine a special case in the language.
     1047Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
     1048Furthermore, implementing coroutines without language supports also displays the power of the programming language used.
     1049While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed by users without using the language support.
     1050The reserved keywords are only present to improve ease of use for the common cases.
     1051
     1052
     1053\subsubsection{Alternative: Lambda Objects}
    9811054
    9821055For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
    983 For example, Boost implements coroutines in terms of four functor object-types:
     1056For example, Boost implements coroutines in terms of four functor object types:
    9841057\begin{cfa}
    9851058asymmetric_coroutine<>::pull_type
     
    9881061symmetric_coroutine<>::yield_type
    9891062\end{cfa}
    990 Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
    991 However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type.
    992 \begin{cfa}
    993 void mycor( coroutine_t cid, void * arg ) {
    994         int * value = (int *)arg;                               $\C{// type unsafe, pointer-size only}$
     1063Often, the canonical threading paradigm in languages is based on function pointers, @pthread@ being one of the most well-known examples.
     1064The main problem of this approach is that the thread usage is limited to a generic handle that must otherwise be wrapped in a custom type.
     1065Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda based coroutines adds very little.
     1066
     1067A variation of this would be to use a simple function pointer in the same way @pthread@ does for threads:
     1068\begin{cfa}
     1069void foo( coroutine_t cid, void* arg ) {
     1070        int* value = (int*)arg;
    9951071        // Coroutine body
    9961072}
     1073
    9971074int main() {
    998         int input = 0, output;
    999         coroutine_t cid = coroutine_create( &mycor, (void *)&input ); $\C{// type unsafe, pointer-size only}$
    1000         coroutine_resume( cid, (void *)input, (void **)&output ); $\C{// type unsafe, pointer-size only}$
    1001 }
    1002 \end{cfa}
    1003 Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little.
    1004 
    1005 The selected approach is to use language support by introducing a new kind of aggregate (structure):
    1006 \begin{cfa}
    1007 coroutine Fibonacci {
    1008         int fn; // communication variables
     1075        int value = 0;
     1076        coroutine_t cid = coroutine_create( &foo, (void*)&value );
     1077        coroutine_resume( &cid );
     1078}
     1079\end{cfa}
     1080This semantics is more common for thread interfaces but coroutines work equally well.
     1081As discussed in section \ref{threads}, this approach is superseded by static approaches in terms of expressivity.
     1082
     1083
     1084\subsubsection{Alternative: Trait-Based Coroutines}
     1085
     1086Finally, the underlying approach, which is the one closest to \CFA idioms, is to use trait-based lazy coroutines.
     1087This approach defines a coroutine as anything that satisfies the trait @is_coroutine@ (as defined below) and is used as a coroutine.
     1088
     1089\begin{cfa}
     1090trait is_coroutine(dtype T) {
     1091      void main(T& this);
     1092      coroutine_desc* get_coroutine(T& this);
    10091093};
    1010 \end{cfa}
    1011 The @coroutine@ keyword means the compiler (and tool set) can find and inject code where needed.
    1012 The downside of this approach is that it makes coroutine a special case in the language.
    1013 Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
    1014 Furthermore, implementing coroutines without language supports also displays the power of a programming language.
    1015 While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using the language support.
    1016 The reserved keyword eases use for the common cases.
    1017 
    1018 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions:
    1019 \begin{cfa}
    1020 trait is_coroutine( dtype T ) {
    1021       void main( T & this );
    1022       coroutine_desc * get_coroutine( T & this );
     1094
     1095forall( dtype T | is_coroutine(T) ) void suspend(T&);
     1096forall( dtype T | is_coroutine(T) ) void resume (T&);
     1097\end{cfa}
     1098This ensures that an object is not a coroutine until @resume@ is called on the object.
     1099Correspondingly, any object that is passed to @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
     1100The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
     1101The \CFA keyword @coroutine@ simply has the effect of implementing the getter and forward declarations required for users to implement the main routine.
     1102
     1103\begin{center}
     1104\begin{tabular}{c c c}
     1105\begin{cfa}[tabsize=3]
     1106coroutine MyCoroutine {
     1107        int someValue;
    10231108};
    1024 forall( dtype T | is_coroutine(T) ) void get_coroutine( T & );
    1025 forall( dtype T | is_coroutine(T) ) void suspend( T & );
    1026 forall( dtype T | is_coroutine(T) ) void resume( T & );
    1027 \end{cfa}
    1028 This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine.
    1029 No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
    1030 As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
    1031 The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
    1032 The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main:
    1033 \begin{cquote}
    1034 \begin{tabular}{@{}ccc@{}}
    1035 \begin{cfa}
    1036 coroutine MyCor {
    1037         int value;
    1038 
     1109\end{cfa} & == & \begin{cfa}[tabsize=3]
     1110struct MyCoroutine {
     1111        int someValue;
     1112        coroutine_desc __cor;
    10391113};
    1040 \end{cfa}
    1041 & {\Large $\Rightarrow$} &
    1042 \begin{tabular}{@{}ccc@{}}
    1043 \begin{cfa}
    1044 struct MyCor {
    1045         int value;
    1046         coroutine_desc cor;
     1114
     1115static inline
     1116coroutine_desc* get_coroutine(
     1117        struct MyCoroutine& this
     1118) {
     1119        return &this.__cor;
     1120}
     1121
     1122void main(struct MyCoroutine* this);
     1123\end{cfa}
     1124\end{tabular}
     1125\end{center}
     1126
     1127The combination of these two approaches allows users new to coroutining and concurrency to have an easy and concise specification, while more advanced users have tighter control on memory layout and initialization.
     1128
     1129\subsection{Thread Interface}\label{threads}
     1130The basic building blocks of multithreading in \CFA are \textbf{cfathread}.
     1131Both user and kernel threads are supported, where user threads are the concurrency mechanism and kernel threads are the parallel mechanism.
     1132User threads offer a flexible and lightweight interface.
     1133A thread can be declared using a struct declaration @thread@ as follows:
     1134
     1135\begin{cfa}
     1136thread foo {};
     1137\end{cfa}
     1138
     1139As for coroutines, the keyword is a thin wrapper around a \CFA trait:
     1140
     1141\begin{cfa}
     1142trait is_thread(dtype T) {
     1143      void ^?{}(T & mutex this);
     1144      void main(T & this);
     1145      thread_desc* get_thread(T & this);
    10471146};
    10481147\end{cfa}
    1049 &
    1050 \begin{cfa}
    1051 static inline coroutine_desc *
    1052 get_coroutine( MyCor & this ) {
    1053         return &this.cor;
    1054 }
    1055 \end{cfa}
    1056 &
    1057 \begin{cfa}
    1058 void main( MyCor * this );
    1059 
    1060 
    1061 
    1062 \end{cfa}
    1063 \end{tabular}
    1064 \end{tabular}
    1065 \end{cquote}
    1066 The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization.
    1067 
    1068 
    1069 \subsection{Thread Interface}
    1070 \label{threads}
    1071 
    1072 Both user and kernel threads are supported, where user threads provide concurrency and kernel threads provide parallelism.
    1073 Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait:
    1074 \begin{cquote}
    1075 \begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}}
    1076 \begin{cfa}
    1077 thread myThread {
    1078         // communication variables
    1079 };
    1080 
    1081 
    1082 \end{cfa}
    1083 &
    1084 \begin{cfa}
    1085 trait is_thread( dtype T ) {
    1086       void main( T & this );
    1087       thread_desc * get_thread( T & this );
    1088       void ^?{}( T & `mutex` this );
    1089 };
    1090 \end{cfa}
    1091 \end{tabular}
    1092 \end{cquote}
    1093 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.)
    1094 Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread.
    1095 The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@;
    1096 whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{
    1097 The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
    1098 No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
    1099 
    1100 \begin{comment} % put in appendix with coroutine version ???
     1148
     1149Obviously, for this thread implementation to be useful it must run some user code.
     1150Several other threading interfaces use a function-pointer representation as the interface of threads (for example \Csharp~\cite{Csharp} and Scala~\cite{Scala}).
     1151However, this proposal considers that statically tying a @main@ routine to a thread supersedes this approach.
     1152Since the @main@ routine is already a special routine in \CFA (where the program begins), it is a natural extension of the semantics to use overloading to declare mains for different threads (the normal main being the main of the initial thread).
    11011153As such the @main@ routine of a thread can be defined as
    11021154\begin{cfa}
     
    11371189}
    11381190\end{cfa}
     1191
    11391192A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}.
    1140 \end{comment}
    1141 
    1142 For user threads to be useful, it must be possible to start and stop the underlying thread, and wait for it to complete execution.
    1143 While using an API such as @fork@ and @join@ is relatively common, such an interface is awkward and unnecessary.
    1144 A simple approach is to use allocation/deallocation principles, and have threads implicitly @fork@ after construction and @join@ before destruction.
    1145 \begin{cfa}
    1146 thread World {};
    1147 void main( World & this ) {
     1193
     1194Of course, for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution.
     1195While using an \textbf{api} such as @fork@ and @join@ is relatively common in the literature, such an interface is unnecessary.
     1196Indeed, the simplest approach is to use \textbf{raii} principles and have threads @fork@ after the constructor has completed and @join@ before the destructor runs.
     1197\begin{cfa}
     1198thread World;
     1199
     1200void main(World & this) {
    11481201        sout | "World!" | endl;
    11491202}
    1150 int main() {
    1151         World w`[10]`;                                                  $\C{// implicit forks after creation}$
    1152         sout | "Hello " | endl;                                 $\C{// "Hello " and 10 "World!" printed concurrently}$
    1153 }                                                                                       $\C{// implicit joins before destruction}$
    1154 \end{cfa}
    1155 This semantics ensures a thread is started and stopped exactly once, eliminating some programming error, and scales to multiple threads for basic (termination) synchronization.
    1156 This tree-structure (lattice) create/delete from C block-structure is generalized by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
    1157 \begin{cfa}
    1158 int main() {
    1159         MyThread * heapLived;
     1203
     1204void main() {
     1205        World w;
     1206        // Thread forks here
     1207
     1208        // Printing "Hello " and "World!" are run concurrently
     1209        sout | "Hello " | endl;
     1210
     1211        // Implicit join at end of scope
     1212}
     1213\end{cfa}
     1214
     1215This semantic has several advantages over explicit semantics: a thread is always started and stopped exactly once, users cannot make any programming errors, and it naturally scales to multiple threads meaning basic synchronization is very simple.
     1216
     1217\begin{cfa}
     1218thread MyThread {
     1219        //...
     1220};
     1221
     1222// main
     1223void main(MyThread& this) {
     1224        //...
     1225}
     1226
     1227void foo() {
     1228        MyThread thrds[10];
     1229        // Start 10 threads at the beginning of the scope
     1230
     1231        DoStuff();
     1232
     1233        // Wait for the 10 threads to finish
     1234}
     1235\end{cfa}
     1236
     1237However, one of the drawbacks of this approach is that threads always form a tree where nodes must always outlive their children, \ie they are always destroyed in the opposite order of construction because of C scoping rules.
     1238This restriction is relaxed by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
     1239
     1240\begin{cfa}
     1241thread MyThread {
     1242        //...
     1243};
     1244
     1245void main(MyThread& this) {
     1246        //...
     1247}
     1248
     1249void foo() {
     1250        MyThread* long_lived;
    11601251        {
    1161                 MyThread blockLived;                            $\C{// fork block-based thread}$
    1162                 heapLived = `new`( MyThread );          $\C{// fork heap-based thread}$
    1163                 ...
    1164         }                                                                               $\C{// join block-based thread}$
    1165         ...
    1166         `delete`( heapLived );                                  $\C{// join heap-based thread}$
    1167 }
    1168 \end{cfa}
    1169 The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency.
    1170 
    1171 Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequential, after all the row threads have terminated.
    1172 The program uses heap-based threads because each thread needs different constructor values.
    1173 (Python provides a simple iteration mechanism to initialize array elements to different values allowing stack allocation.)
    1174 The allocation/deallocation pattern appears unusual because allocated objects are immediately deleted without any intervening code.
    1175 However, for threads, the deletion provides implicit synchronization, which is the intervening code.
    1176 While the subtotals are added in linear order rather than completion order, which slight inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial.
    1177 
    1178 \begin{figure}
    1179 \begin{cfa}
    1180 thread Adder {
    1181     int * row, cols, & subtotal;                        $\C{// communication}$
    1182 };
    1183 void ?{}( Adder & adder, int row[], int cols, int & subtotal ) {
    1184     adder.[ row, cols, &subtotal ] = [ row, cols, &subtotal ];
    1185 }
    1186 void main( Adder & adder ) with( adder ) {
    1187     subtotal = 0;
    1188     for ( int c = 0; c < cols; c += 1 ) {
    1189                 subtotal += row[c];
    1190     }
    1191 }
    1192 int main() {
    1193     const int rows = 10, cols = 1000;
    1194     int matrix[rows][cols], subtotals[rows], total = 0;
    1195     // read matrix
    1196     Adder * adders[rows];
    1197     for ( int r = 0; r < rows; r += 1 ) {       $\C{// start threads to sum rows}$
    1198                 adders[r] = new( matrix[r], cols, &subtotals[r] );
    1199     }
    1200     for ( int r = 0; r < rows; r += 1 ) {       $\C{// wait for threads to finish}$
    1201                 delete( adders[r] );                            $\C{// termination join}$
    1202                 total += subtotals[r];                          $\C{// total subtotal}$
    1203     }
    1204     sout | total | endl;
    1205 }
    1206 \end{cfa}
    1207 \caption{Concurrent Matrix Summation}
    1208 \label{s:ConcurrentMatrixSummation}
    1209 \end{figure}
    1210 
    1211 
    1212 \section{Synchronization / Mutual Exclusion}
    1213 
    1214 Uncontrolled non-deterministic execution is meaningless.
    1215 To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where synchronization is a timing relationship among threads and mutual exclusion is an access-control mechanism on data shared by threads.
    1216 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)).
    1217 In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}).
    1218 However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing).
    1219 This distinction means a programmers needs to learn two sets of design patterns.
     1252                // Start a thread at the beginning of the scope
     1253                MyThread short_lived;
     1254
     1255                // create another thread that will outlive the thread in this scope
     1256                long_lived = new MyThread;
     1257
     1258                DoStuff();
     1259
     1260                // Wait for the thread short_lived to finish
     1261        }
     1262        DoMoreStuff();
     1263
     1264        // Now wait for the long_lived to finish
     1265        delete long_lived;
     1266}
     1267\end{cfa}
     1268
     1269
     1270% ======================================================================
     1271% ======================================================================
     1272\section{Concurrency}
     1273% ======================================================================
     1274% ======================================================================
     1275Several tools can be used to solve concurrency challenges.
     1276Since many of these challenges appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka (Scala)~\cite{Akka}).
     1277In these paradigms, interaction among concurrent objects relies on message passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (channels~\cite{CSP,Go} for example).
     1278However, in languages that use routine calls as their core abstraction mechanism, these approaches force a clear distinction between concurrent and non-concurrent paradigms (\ie message passing versus routine calls).
     1279This distinction in turn means that, in order to be effective, programmers need to learn two sets of design patterns.
    12201280While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
    1221 In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
    1222 
    1223 At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
    1224 However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
    1225 A newer approach is transactional memory~\cite{Herlihy93}.
    1226 While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
    1227 
    1228 One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
     1281
     1282Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on basic constructs like routine calls and shared objects.
     1283At the lowest level, concurrent paradigms are implemented as atomic operations and locks.
     1284Many such mechanisms have been proposed, including semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
     1285However, for productivity reasons it is desirable to have a higher-level construct be the core concurrency paradigm~\cite{Hochstein05}.
     1286
     1287An approach that is worth mentioning because it is gaining in popularity is transactional memory~\cite{Herlihy93}.
     1288While this approach is even pursued by system languages like \CC~\cite{Cpp-Transactions}, the performance and feature set is currently too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
     1289
     1290One of the most natural, elegant, and efficient mechanisms for synchronization and communication, especially for shared-memory systems, is the \emph{monitor}.
    12291291Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
    1230 Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs.
     1292Many programming languages---\eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs.
    12311293In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
    1232 For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..
    1233 
    1234 
    1235 \subsection{Mutual Exclusion}
    1236 
    1237 A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}.
    1238 A generalization is a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
    1239 The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session.
    1240 \newterm{Mutual exclusion} enforces the correction number of threads are using a critical section at the same time.
    1241 
     1294For these reasons, this project proposes monitors as the core concurrency construct.
     1295
     1296
     1297\subsection{Basics}
     1298
     1299Non-determinism requires concurrent systems to offer support for mutual-exclusion and synchronization.
     1300Mutual-exclusion is the concept that only a fixed number of threads can access a critical section at any given time, where a critical section is a group of instructions on an associated portion of data that requires the restricted access.
     1301On the other hand, synchronization enforces relative ordering of execution and synchronization tools provide numerous mechanisms to establish timing relationships among threads.
     1302
     1303
     1304\subsubsection{Mutual-Exclusion}
     1305
     1306As mentioned above, mutual-exclusion is the guarantee that only a fix number of threads can enter a critical section at once.
    12421307However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use.
    1243 Methods range from low-level locks, which are fast and flexible but require significant attention for correctness, to higher-level concurrency techniques, which sacrifice some performance to improve ease of use.
    1244 Ease of use comes by either guaranteeing some problems cannot occur (\eg deadlock free), or by offering a more explicit coupling between shared data and critical section.
     1308Methods range from low-level locks, which are fast and flexible but require significant attention to be correct, to  higher-level concurrency techniques, which sacrifice some performance in order to improve ease of use.
     1309Ease of use comes by either guaranteeing some problems cannot occur (\eg being deadlock free) or by offering a more explicit coupling between data and corresponding critical section.
    12451310For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing large types atomically).
    1246 However, a significant challenge with (low-level) locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
    1247 Easing composability is another feature higher-level mutual-exclusion mechanisms offer.
    1248 
    1249 
    1250 \subsection{Synchronization}
    1251 
    1252 Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships.
    1253 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use.
    1254 Higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
     1311Another challenge with low-level locks is composability.
     1312Locks have restricted composability because it takes careful organizing for multiple locks to be used while preventing deadlocks.
     1313Easing composability is another feature higher-level mutual-exclusion mechanisms often offer.
     1314
     1315
     1316\subsubsection{Synchronization}
     1317
     1318As with mutual-exclusion, low-level synchronization primitives often offer good performance and good flexibility at the cost of ease of use.
     1319Again, higher-level mechanisms often simplify usage by adding either better coupling between synchronization and data (\eg message passing) or offering a simpler solution to otherwise involved challenges.
    12551320As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}.
    1256 Often synchronization is used to order access to a critical section, \eg ensuring the next kind of thread to enter a critical section is a reader thread
    1257 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, the reader has \newterm{barged}.
    1258 Barging can result in staleness/freshness problems, where a reader barges ahead of a write and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from having an opportunity to be read.
     1321Most of the time, synchronization happens within a critical section, where threads must acquire mutual-exclusion in a certain order.
     1322However, it may also be desirable to guarantee that event \textit{Z} does not occur between \textit{X} and \textit{Y}.
     1323Not satisfying this property is called \textbf{barging}.
     1324For example, where event \textit{X} tries to effect event \textit{Y} but another thread acquires the critical section and emits \textit{Z} before \textit{Y}.
     1325The classic example is the thread that finishes using a resource and unblocks a thread waiting to use the resource, but the unblocked thread must compete to acquire the resource.
    12591326Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs.
    1260 This challenge is often split into two different approaches, barging avoidance and barging prevention.
    1261 Algorithms that allow a barger but divert it until later are avoiding the barger, while algorithms that preclude a barger from entering during synchronization in the critical section prevent the barger completely.
    1262 baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
    1263 
    1264 
     1327This challenge is often split into two different methods, barging avoidance and barging prevention.
     1328Algorithms that use flag variables to detect barging threads are said to be using barging avoidance, while algorithms that baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
     1329
     1330
     1331% ======================================================================
     1332% ======================================================================
    12651333\section{Monitors}
    1266 \label{s:Monitors}
    1267 
     1334% ======================================================================
     1335% ======================================================================
    12681336A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state.
    12691337More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine.
     
    24332501Given these building blocks, it is possible to reproduce all three of the popular paradigms.
    24342502Indeed, \textbf{uthread} is the default paradigm in \CFA.
    2435 However, disabling \textbf{preemption} on a cluster means threads effectively become fibers.
     2503However, disabling \textbf{preemption} on the \textbf{cfacluster} means \textbf{cfathread} effectively become \textbf{fiber}.
    24362504Since several \textbf{cfacluster} with different scheduling policy can coexist in the same application, this allows \textbf{fiber} and \textbf{uthread} to coexist in the runtime of an application.
    24372505Finally, it is possible to build executors for thread pools from \textbf{uthread} or \textbf{fiber}, which includes specialized jobs like actors~\cite{Actors}.
Note: See TracChangeset for help on using the changeset viewer.