Context Navigation

Reverse Diff

Paper.tex [251454a0:48b9b36]

File:

: 1 edited

doc/papers/concurrency/Paper.tex (modified) (12 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/papers/concurrency/Paper.tex

-              r251454a0
+              r48b9b36
 %\DeclareTextCommandDefault{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.1ex}}}
 \renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}}
+%\def\myCHarFont{\fontencoding{T1}\selectfont}%
+% \def\{{\ttfamily\upshape\myCHarFont \char`\}}}%
 \renewcommand*{\thefootnote}{\Alph{footnote}} % hack because fnsymbol does not work
 …
 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume.
 The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@;
 on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
+on return, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
 The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack;
 when the coroutine main returns, its stack is deallocated.
 Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes.
 Figure~\ref{f:Coroutine1State} shows the coroutine version of the C version in Figure~\ref{f:ExternalState}.
 Coroutine generators are called \newterm{output coroutines} because values are only returned.
 Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of characters of fixed-size blocks.
+Coroutine generators are called \newterm{output coroutines} because values are returned by the coroutine.
+Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of character blocks of fixed size.
 For example, the input of the left is reformatted into the output on the right.
 \begin{quote}
 …
 \end{tabular}
 \end{quote}
 The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops.
+The example takes advantage of resuming coroutines in the constructor to prime the coroutine loops so the first character sent for formatting appears inside the nested loops.
 The destruction provides a newline if formatted text ends with a full line.
 Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flatten (linearized) and rechecked on each call because execution location is not retained between calls.
 …
 void main( Format & fmt ) with( fmt ) {
         for ( ;; ) {
                 for ( g = 0; g < 5; g += 1 ) {      // group
+                for ( g = 0; g < 5; g += 1 ) {  // group
                         for ( b = 0; b < 4; b += 1 ) { // block
                                 `suspend();`
 …
 };
 void format( struct Format * fmt ) {
         if ( fmt->ch != -1 ) {      // not EOF ?
+        if ( fmt->ch != -1 ) { // not EOF
                 printf( "%c", fmt->ch );
                 fmt->b += 1;
 …
+                }
                 if ( fmt->g == 5 ) {  // group
                         printf( "\n" );     // separator
+                        printf( "\n" );      // separator
                         fmt->g = 0;
+                }
 …
 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions.
 However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones.
 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle.
+However, there is no stack growth because @resume@/@suspend@ context switch to an existing stack frames rather than create a new one.
+\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a cycle.
 (The trivial cycle is a coroutine resuming itself.)
 This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch.
 …
 The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
 Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
 @prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer.
+@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random vales, calling the consumer to deliver the values, and printing the status returned from the consumer.
 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
 For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine.
 The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).
 The call from the consumer to the @payment@ introduces the cycle between producer and consumer.
+The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer's @payment@ member, and on return prints the receipt from the producer and increments the money for the next payment.
+The call from the consumer to the producer's @payment@ member introduces the cycle between producer and consumer.
 When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed.
 The context switch restarts the producer at the point where it was last context switched, so it continues in @delivery@ after the resume.
 @delivery@ returns the status value in @prod@'s coroutine main, where the status is printed.
+The context switch restarts the producer at the point where it was last context switched and it continues in member @delivery@ after the resume.
+The @delivery@ member returns the status value in @prod@'s @main@ member, where the status is printed.
 The loop then repeats calling @delivery@, where each call resumes the consumer coroutine.
 The context switch to the consumer continues in @payment@.
 The consumer increments and returns the receipt to the call in @cons@'s coroutine main.
+The consumer increments and returns the receipt to the call in @cons@'s @main@ member.
 The loop then repeats calling @payment@, where each call resumes the producer coroutine.
 …
 The context switch restarts @cons@ in @payment@ and it returns with the last receipt.
 The consumer terminates its loops because @done@ is true, its @main@ terminates, so @cons@ transitions from a coroutine back to an object, and @prod@ reactivates after the resume in @stop@.
 @stop@ returns and @prod@'s coroutine main terminates.
+The @stop@ member returns and @prod@'s @main@ member terminates.
 The program main restarts after the resume in @start@.
+@start@ returns and the program main terminates.
+\subsection{Coroutine Implementation}
+A significant implementation challenge for coroutines (and threads, see section \ref{threads}) is adding extra fields and executing code after/before the coroutine constructor/destructor and coroutine main to create/initialize/de-initialize/destroy extra fields and the stack.
+There are several solutions to this problem and the chosen option forced the \CFA coroutine design.
+Object-oriented inheritance provides extra fields and code in a restricted context, but it requires programmers to explicitly perform the inheritance:
+\begin{cfa}
+struct mycoroutine $\textbf{\textsf{inherits}}$ baseCoroutine { ... }
+\end{cfa}
+and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack.
+Furthermore, the execution of constructs/destructors is in the wrong order for certain operations, \eg for threads;
+\eg, if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
+An alternatively is composition:
+\begin{cfa}
+struct mycoroutine {
+        ... // declarations
+        baseCoroutine dummy; // composition, last declaration
+}
+\end{cfa}
+which also requires an explicit declaration that must be the last one to ensure correct initialization order.
+However, there is nothing preventing wrong placement or multiple declarations.
+The @start@ member returns and the program main terminates.
+\subsubsection{Construction}
+One important design challenge for implementing coroutines and threads (shown in section \ref{threads}) is that the runtime system needs to run code after the user-constructor runs to connect the fully constructed object into the system.
+In the case of coroutines, this challenge is simpler since there is no non-determinism from preemption or scheduling.
+However, the underlying challenge remains the same for coroutines and threads.
+The runtime system needs to create the coroutine's stack and, more importantly, prepare it for the first resumption.
+The timing of the creation is non-trivial since users expect both to have fully constructed objects once execution enters the coroutine main and to be able to resume the coroutine from the constructor.
+There are several solutions to this problem but the chosen option effectively forces the design of the coroutine.
+Furthermore, \CFA faces an extra challenge as polymorphic routines create invisible thunks when cast to non-polymorphic routines and these thunks have function scope.
+For example, the following code, while looking benign, can run into undefined behaviour because of thunks:
+\begin{cfa}
+// async: Runs function asynchronously on another thread
+forall(otype T)
+extern void async(void (*func)(T*), T* obj);
+forall(otype T)
+void noop(T*) {}
+void bar() {
+        int a;
+        async(noop, &a); // start thread running noop with argument a
+}
+\end{cfa}
+The generated C code\footnote{Code trimmed down for brevity} creates a local thunk to hold type information:
+\begin{cfa}
+extern void async(/* omitted */, void (*func)(void*), void* obj);
+void noop(/* omitted */, void* obj){}
+void bar(){
+        int a;
+        void _thunk0(int* _p0){
+                /* omitted */
+                noop(/* omitted */, _p0);
+        }
+        /* omitted */
+        async(/* omitted */, ((void (*)(void*))(&_thunk0)), (&a));
+}
+\end{cfa}
+The problem in this example is a storage management issue, the function pointer @_thunk0@ is only valid until the end of the block, which limits the viable solutions because storing the function pointer for too long causes undefined behaviour; \ie the stack-based thunk being destroyed before it can be used.
+This challenge is an extension of challenges that come with second-class routines.
+Indeed, GCC nested routines also have the limitation that nested routine cannot be passed outside of the declaration scope.
+The case of coroutines and threads is simply an extension of this problem to multiple call stacks.
+\subsubsection{Alternative: Composition}
+One solution to this challenge is to use composition/containment, where coroutine fields are added to manage the coroutine.
+\begin{cfa}
+struct Fibonacci {
+        int fn; // used for communication
+        coroutine c; // composition
+};
+void FibMain(void*) {
+        //...
+}
+void ?{}(Fibonacci& this) {
+        this.fn = 0;
+        // Call constructor to initialize coroutine
+        (this.c){myMain};
+}
+\end{cfa}
+The downside of this approach is that users need to correctly construct the coroutine handle before using it.
+Like any other objects, the user must carefully choose construction order to prevent usage of objects not yet constructed.
+However, in the case of coroutines, users must also pass to the coroutine information about the coroutine main, like in the previous example.
+This opens the door for user errors and requires extra runtime storage to pass at runtime information that can be known statically.
+\subsubsection{Alternative: Reserved keyword}
+The next alternative is to use language support to annotate coroutines as follows:
+\begin{cfa}
+coroutine Fibonacci {
+        int fn; // used for communication
+};
+\end{cfa}
+The @coroutine@ keyword means the compiler can find and inject code where needed.
+The downside of this approach is that it makes coroutine a special case in the language.
+Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
+Furthermore, implementing coroutines without language supports also displays the power of the programming language used.
+While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed by users without using the language support.
+The reserved keywords are only present to improve ease of use for the common cases.
+\subsubsection{Alternative: Lambda Objects}
 For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
 For example, Boost implements coroutines in terms of four functor object-types:
+For example, Boost implements coroutines in terms of four functor object types:
 \begin{cfa}
 asymmetric_coroutine<>::pull_type
 …
 symmetric_coroutine<>::yield_type
 \end{cfa}
+Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
+However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type.
+\begin{cfa}
+void mycor( coroutine_t cid, void * arg ) {
+        int * value = (int *)arg;                               $\C{// type unsafe, pointer-size only}$
+Often, the canonical threading paradigm in languages is based on function pointers, @pthread@ being one of the most well-known examples.
+The main problem of this approach is that the thread usage is limited to a generic handle that must otherwise be wrapped in a custom type.
+Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda based coroutines adds very little.
+A variation of this would be to use a simple function pointer in the same way @pthread@ does for threads:
+\begin{cfa}
+void foo( coroutine_t cid, void* arg ) {
+        int* value = (int*)arg;
         // Coroutine body
+}
 int main() {
+        int input = 0, output;
+        coroutine_t cid = coroutine_create( &mycor, (void *)&input ); $\C{// type unsafe, pointer-size only}$
+        coroutine_resume( cid, (void *)input, (void **)&output ); $\C{// type unsafe, pointer-size only}$
+}
+\end{cfa}
+Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little.
+The selected approach is to use language support by introducing a new kind of aggregate (structure):
+\begin{cfa}
+coroutine Fibonacci {
+        int fn; // communication variables
+        int value = 0;
+        coroutine_t cid = coroutine_create( &foo, (void*)&value );
+        coroutine_resume( &cid );
+}
+\end{cfa}
+This semantics is more common for thread interfaces but coroutines work equally well.
+As discussed in section \ref{threads}, this approach is superseded by static approaches in terms of expressivity.
+\subsubsection{Alternative: Trait-Based Coroutines}
+Finally, the underlying approach, which is the one closest to \CFA idioms, is to use trait-based lazy coroutines.
+This approach defines a coroutine as anything that satisfies the trait @is_coroutine@ (as defined below) and is used as a coroutine.
+\begin{cfa}
+trait is_coroutine(dtype T) {
+      void main(T& this);
+      coroutine_desc* get_coroutine(T& this);
 };
+\end{cfa}
+The @coroutine@ keyword means the compiler (and tool set) can find and inject code where needed.
+The downside of this approach is that it makes coroutine a special case in the language.
+Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
+Furthermore, implementing coroutines without language supports also displays the power of a programming language.
+While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using the language support.
+The reserved keyword eases use for the common cases.
+Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions:
+\begin{cfa}
+trait is_coroutine( dtype T ) {
+      void main( T & this );
+      coroutine_desc * get_coroutine( T & this );
+forall( dtype T | is_coroutine(T) ) void suspend(T&);
+forall( dtype T | is_coroutine(T) ) void resume (T&);
+\end{cfa}
+This ensures that an object is not a coroutine until @resume@ is called on the object.
+Correspondingly, any object that is passed to @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
+The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
+The \CFA keyword @coroutine@ simply has the effect of implementing the getter and forward declarations required for users to implement the main routine.
+\begin{center}
+\begin{tabular}{c c c}
+\begin{cfa}[tabsize=3]
+coroutine MyCoroutine {
+        int someValue;
 };
+forall( dtype T | is_coroutine(T) ) void get_coroutine( T & );
+forall( dtype T | is_coroutine(T) ) void suspend( T & );
+forall( dtype T | is_coroutine(T) ) void resume( T & );
+\end{cfa}
+This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine.
+No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
+As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
+The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
+The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main:
+\begin{cquote}
+\begin{tabular}{@{}ccc@{}}
+\begin{cfa}
+coroutine MyCor {
+        int value;
+\end{cfa} & == & \begin{cfa}[tabsize=3]
+struct MyCoroutine {
+        int someValue;
+        coroutine_desc __cor;
 };
+\end{cfa}
+& {\Large $\Rightarrow$} &
+\begin{tabular}{@{}ccc@{}}
+\begin{cfa}
+struct MyCor {
+        int value;
+        coroutine_desc cor;
+static inline
+coroutine_desc* get_coroutine(
+        struct MyCoroutine& this
+) {
+        return &this.__cor;
+}
+void main(struct MyCoroutine* this);
+\end{cfa}
+\end{tabular}
+\end{center}
+The combination of these two approaches allows users new to coroutining and concurrency to have an easy and concise specification, while more advanced users have tighter control on memory layout and initialization.
+\subsection{Thread Interface}\label{threads}
+The basic building blocks of multithreading in \CFA are \textbf{cfathread}.
+Both user and kernel threads are supported, where user threads are the concurrency mechanism and kernel threads are the parallel mechanism.
+User threads offer a flexible and lightweight interface.
+A thread can be declared using a struct declaration @thread@ as follows:
+\begin{cfa}
+thread foo {};
+\end{cfa}
+As for coroutines, the keyword is a thin wrapper around a \CFA trait:
+\begin{cfa}
+trait is_thread(dtype T) {
+      void ^?{}(T & mutex this);
+      void main(T & this);
+      thread_desc* get_thread(T & this);
 };
 \end{cfa}
+&
+\begin{cfa}
+static inline coroutine_desc *
+get_coroutine( MyCor & this ) {
+        return &this.cor;
+}
+\end{cfa}
+&
+\begin{cfa}
+void main( MyCor * this );
+\end{cfa}
+\end{tabular}
+\end{tabular}
+\end{cquote}
+The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization.
+\subsection{Thread Interface}
+\label{threads}
+Both user and kernel threads are supported, where user threads provide concurrency and kernel threads provide parallelism.
+Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait:
+\begin{cquote}
+\begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}}
+\begin{cfa}
+thread myThread {
+        // communication variables
+};
+\end{cfa}
+&
+\begin{cfa}
+trait is_thread( dtype T ) {
+      void main( T & this );
+      thread_desc * get_thread( T & this );
+      void ^?{}( T & `mutex` this );
+};
+\end{cfa}
+\end{tabular}
+\end{cquote}
+(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.)
+Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread.
+The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@;
+whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{
+The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
+No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
+\begin{comment} % put in appendix with coroutine version ???
+Obviously, for this thread implementation to be useful it must run some user code.
+Several other threading interfaces use a function-pointer representation as the interface of threads (for example \Csharp~\cite{Csharp} and Scala~\cite{Scala}).
+However, this proposal considers that statically tying a @main@ routine to a thread supersedes this approach.
+Since the @main@ routine is already a special routine in \CFA (where the program begins), it is a natural extension of the semantics to use overloading to declare mains for different threads (the normal main being the main of the initial thread).
 As such the @main@ routine of a thread can be defined as
 \begin{cfa}
 …
+}
 \end{cfa}
 A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}.
+\end{comment}
 For user threads to be useful, it must be possible to start and stop the underlying thread, and wait for it to complete execution.
 While using an API such as @fork@ and @join@ is relatively common, such an interface is awkward and unnecessary.
+A simple approach is to use allocation/deallocation principles, and have threads implicitly @fork@ after construction and @join@ before destruction.
+\begin{cfa}
+thread World {};
 void main( World & this ) {
+Of course, for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution.
+While using an \textbf{api} such as @fork@ and @join@ is relatively common in the literature, such an interface is unnecessary.
+Indeed, the simplest approach is to use \textbf{raii} principles and have threads @fork@ after the constructor has completed and @join@ before the destructor runs.
+\begin{cfa}
+thread World;
+void main(World & this) {
         sout | "World!" | endl;
+}
+int main() {
+        World w`[10]`;                                                  $\C{// implicit forks after creation}$
+        sout | "Hello " | endl;                                 $\C{// "Hello " and 10 "World!" printed concurrently}$
+}                                                                                       $\C{// implicit joins before destruction}$
+\end{cfa}
+This semantics ensures a thread is started and stopped exactly once, eliminating some programming error, and scales to multiple threads for basic (termination) synchronization.
+This tree-structure (lattice) create/delete from C block-structure is generalized by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
+\begin{cfa}
+int main() {
+        MyThread * heapLived;
+void main() {
+        World w;
+        // Thread forks here
+        // Printing "Hello " and "World!" are run concurrently
+        sout | "Hello " | endl;
+        // Implicit join at end of scope
+}
+\end{cfa}
+This semantic has several advantages over explicit semantics: a thread is always started and stopped exactly once, users cannot make any programming errors, and it naturally scales to multiple threads meaning basic synchronization is very simple.
+\begin{cfa}
+thread MyThread {
+        //...
+};
+// main
+void main(MyThread& this) {
+        //...
+}
+void foo() {
+        MyThread thrds[10];
+        // Start 10 threads at the beginning of the scope
+        DoStuff();
+        // Wait for the 10 threads to finish
+}
+\end{cfa}
+However, one of the drawbacks of this approach is that threads always form a tree where nodes must always outlive their children, \ie they are always destroyed in the opposite order of construction because of C scoping rules.
+This restriction is relaxed by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
+\begin{cfa}
+thread MyThread {
+        //...
+};
+void main(MyThread& this) {
+        //...
+}
+void foo() {
+        MyThread* long_lived;
+        {
+                MyThread blockLived;                            $\C{// fork block-based thread}$
+                heapLived = `new`( MyThread );          $\C{// fork heap-based thread}$
+                ...
+        }                                                                               $\C{// join block-based thread}$
+        ...
+        `delete`( heapLived );                                  $\C{// join heap-based thread}$
+}
+\end{cfa}
+The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency.
+Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequential, after all the row threads have terminated.
+The program uses heap-based threads because each thread needs different constructor values.
+(Python provides a simple iteration mechanism to initialize array elements to different values allowing stack allocation.)
+The allocation/deallocation pattern appears unusual because allocated objects are immediately deleted without any intervening code.
+However, for threads, the deletion provides implicit synchronization, which is the intervening code.
+While the subtotals are added in linear order rather than completion order, which slight inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial.
+\begin{figure}
+\begin{cfa}
+thread Adder {
+    int * row, cols, & subtotal;                        $\C{// communication}$
+};
+void ?{}( Adder & adder, int row[], int cols, int & subtotal ) {
+    adder.[ row, cols, &subtotal ] = [ row, cols, &subtotal ];
+}
+void main( Adder & adder ) with( adder ) {
+    subtotal = 0;
+    for ( int c = 0; c < cols; c += 1 ) {
+                subtotal += row[c];
+    }
+}
+int main() {
+    const int rows = 10, cols = 1000;
+    int matrix[rows][cols], subtotals[rows], total = 0;
+    // read matrix
+    Adder * adders[rows];
+    for ( int r = 0; r < rows; r += 1 ) {       $\C{// start threads to sum rows}$
+                adders[r] = new( matrix[r], cols, &subtotals[r] );
+    }
+    for ( int r = 0; r < rows; r += 1 ) {       $\C{// wait for threads to finish}$
+                delete( adders[r] );                            $\C{// termination join}$
+                total += subtotals[r];                          $\C{// total subtotal}$
+    }
+    sout | total | endl;
+}
+\end{cfa}
+\caption{Concurrent Matrix Summation}
+\label{s:ConcurrentMatrixSummation}
+\end{figure}
+\section{Synchronization / Mutual Exclusion}
+Uncontrolled non-deterministic execution is meaningless.
+To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where synchronization is a timing relationship among threads and mutual exclusion is an access-control mechanism on data shared by threads.
+Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)).
+In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}).
+However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing).
+This distinction means a programmers needs to learn two sets of design patterns.
+                // Start a thread at the beginning of the scope
+                MyThread short_lived;
+                // create another thread that will outlive the thread in this scope
+                long_lived = new MyThread;
+                DoStuff();
+                // Wait for the thread short_lived to finish
+        }
+        DoMoreStuff();
+        // Now wait for the long_lived to finish
+        delete long_lived;
+}
+\end{cfa}
+% ======================================================================
+% ======================================================================
+\section{Concurrency}
+% ======================================================================
+% ======================================================================
+Several tools can be used to solve concurrency challenges.
+Since many of these challenges appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka (Scala)~\cite{Akka}).
+In these paradigms, interaction among concurrent objects relies on message passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (channels~\cite{CSP,Go} for example).
+However, in languages that use routine calls as their core abstraction mechanism, these approaches force a clear distinction between concurrent and non-concurrent paradigms (\ie message passing versus routine calls).
+This distinction in turn means that, in order to be effective, programmers need to learn two sets of design patterns.
 While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
+In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
+At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
+However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
+A newer approach is transactional memory~\cite{Herlihy93}.
+While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
+One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
+Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on basic constructs like routine calls and shared objects.
+At the lowest level, concurrent paradigms are implemented as atomic operations and locks.
+Many such mechanisms have been proposed, including semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
+However, for productivity reasons it is desirable to have a higher-level construct be the core concurrency paradigm~\cite{Hochstein05}.
+An approach that is worth mentioning because it is gaining in popularity is transactional memory~\cite{Herlihy93}.
+While this approach is even pursued by system languages like \CC~\cite{Cpp-Transactions}, the performance and feature set is currently too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
+One of the most natural, elegant, and efficient mechanisms for synchronization and communication, especially for shared-memory systems, is the \emph{monitor}.
 Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
 Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs.
+Many programming languages---\eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs.
 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
+For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..
+\subsection{Mutual Exclusion}
+A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}.
+A generalization is a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously.
+The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session.
+\newterm{Mutual exclusion} enforces the correction number of threads are using a critical section at the same time.
+For these reasons, this project proposes monitors as the core concurrency construct.
+\subsection{Basics}
+Non-determinism requires concurrent systems to offer support for mutual-exclusion and synchronization.
+Mutual-exclusion is the concept that only a fixed number of threads can access a critical section at any given time, where a critical section is a group of instructions on an associated portion of data that requires the restricted access.
+On the other hand, synchronization enforces relative ordering of execution and synchronization tools provide numerous mechanisms to establish timing relationships among threads.
+\subsubsection{Mutual-Exclusion}
+As mentioned above, mutual-exclusion is the guarantee that only a fix number of threads can enter a critical section at once.
 However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use.
 Methods range from low-level locks, which are fast and flexible but require significant attention for correctness, to higher-level concurrency techniques, which sacrifice some performance to improve ease of use.
 Ease of use comes by either guaranteeing some problems cannot occur (\eg deadlock free), or by offering a more explicit coupling between shared data and critical section.
+Methods range from low-level locks, which are fast and flexible but require significant attention to be correct, to  higher-level concurrency techniques, which sacrifice some performance in order to improve ease of use.
+Ease of use comes by either guaranteeing some problems cannot occur (\eg being deadlock free) or by offering a more explicit coupling between data and corresponding critical section.
 For example, the \CC @std::atomic<T>@ offers an easy way to express mutual-exclusion on a restricted set of operations (\eg reading/writing large types atomically).
 However, a significant challenge with (low-level) locks is composability because it takes careful organization for multiple locks to be used while preventing deadlock.
 Easing composability is another feature higher-level mutual-exclusion mechanisms offer.
+\subsection{Synchronization}
+Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships.
 Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use.
 Higher-level mechanisms often simplify usage by adding better coupling between synchronization and data (\eg message passing), or offering a simpler solution to otherwise involved challenges, \eg barrier lock.
+Another challenge with low-level locks is composability.
+Locks have restricted composability because it takes careful organizing for multiple locks to be used while preventing deadlocks.
+Easing composability is another feature higher-level mutual-exclusion mechanisms often offer.
+\subsubsection{Synchronization}
+As with mutual-exclusion, low-level synchronization primitives often offer good performance and good flexibility at the cost of ease of use.
+Again, higher-level mechanisms often simplify usage by adding either better coupling between synchronization and data (\eg message passing) or offering a simpler solution to otherwise involved challenges.
 As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}.
+Often synchronization is used to order access to a critical section, \eg ensuring the next kind of thread to enter a critical section is a reader thread
+If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, the reader has \newterm{barged}.
+Barging can result in staleness/freshness problems, where a reader barges ahead of a write and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from having an opportunity to be read.
+Most of the time, synchronization happens within a critical section, where threads must acquire mutual-exclusion in a certain order.
+However, it may also be desirable to guarantee that event \textit{Z} does not occur between \textit{X} and \textit{Y}.
+Not satisfying this property is called \textbf{barging}.
+For example, where event \textit{X} tries to effect event \textit{Y} but another thread acquires the critical section and emits \textit{Z} before \textit{Y}.
+The classic example is the thread that finishes using a resource and unblocks a thread waiting to use the resource, but the unblocked thread must compete to acquire the resource.
 Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs.
+This challenge is often split into two different approaches, barging avoidance and barging prevention.
+Algorithms that allow a barger but divert it until later are avoiding the barger, while algorithms that preclude a barger from entering during synchronization in the critical section prevent the barger completely.
+baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
+This challenge is often split into two different methods, barging avoidance and barging prevention.
+Algorithms that use flag variables to detect barging threads are said to be using barging avoidance, while algorithms that baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
+% ======================================================================
+% ======================================================================
 \section{Monitors}
+\label{s:Monitors}
+% ======================================================================
+% ======================================================================
 A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state.
 More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine.
 …
 Given these building blocks, it is possible to reproduce all three of the popular paradigms.
 Indeed, \textbf{uthread} is the default paradigm in \CFA.
 However, disabling \textbf{preemption} on a cluster means threads effectively become fibers.
+However, disabling \textbf{preemption} on the \textbf{cfacluster} means \textbf{cfathread} effectively become \textbf{fiber}.
 Since several \textbf{cfacluster} with different scheduling policy can coexist in the same application, this allows \textbf{fiber} and \textbf{uthread} to coexist in the runtime of an application.
 Finally, it is possible to build executors for thread pools from \textbf{uthread} or \textbf{fiber}, which includes specialized jobs like actors~\cite{Actors}.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changes in doc/papers/concurrency/Paper.tex [251454a0:48b9b36]

Legend:

doc/papers/concurrency/Paper.tex

Download in other formats: