Context Navigation

← Previous Change
Next Change →

Paper.tex

Timestamp:

May 24, 2019, 10:19:41 AM (6 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, arm-eh, ast-experimental, cleanup-dtors, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum

Children:

d908563

Parents:

6a9d4b4 (diff), 292642a (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' into cleanup-dtors

File:

: 1 edited

doc/papers/concurrency/Paper.tex (modified) (15 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/papers/concurrency/Paper.tex

-              r6a9d4b4
+              r933f32f
 {}
 \lstnewenvironment{Go}[1][]
+{\lstset{#1}}
+{\lstset{language=go,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{}
+\lstnewenvironment{python}[1][]
+{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
 {}
 …
+}
 \title{\texorpdfstring{Concurrency in \protect\CFA}{Concurrency in Cforall}}
+\title{\texorpdfstring{Advanced Control-flow and Concurrency in \protect\CFA}{Advanced Control-flow in Cforall}}
 \author[1]{Thierry Delisle}
 …
 \corres{*Peter A. Buhr, Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada. \email{pabuhr{\char`\@}uwaterloo.ca}}
 \fundingInfo{Natural Sciences and Engineering Research Council of Canada}
+% \fundingInfo{Natural Sciences and Engineering Research Council of Canada}
 \abstract[Summary]{
+\CFA is a modern, polymorphic, \emph{non-object-oriented} extension of the C programming language.
+This paper discusses the design of the concurrency and parallelism features in \CFA, and its concurrent runtime-system.
+These features are created from scratch as ISO C lacks concurrency, relying largely on the pthreads library for concurrency.
+Coroutines and lightweight (user) threads are introduced into \CFA;
+as well, monitors are added as a high-level mechanism for mutual exclusion and synchronization.
+A unique contribution of this work is allowing multiple monitors to be safely acquired \emph{simultaneously}.
+All features respect the expectations of C programmers, while being fully integrate with the \CFA polymorphic type-system and other language features.
+\CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
+This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime.
+These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like C pthreads.
+\CFA introduces modern language-level control-flow mechanisms, like coroutines, user-level threading, and monitors for mutual exclusion and synchronization.
+Library extension for executors, futures, and actors are built on these basic mechanisms.
+The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and reducing monitor barging.
+The runtime also ensures multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), and this feature is fully integrated with all monitor synchronization mechanisms.
+All language features integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers.
 Experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
 }%
 \keywords{concurrency, parallelism, coroutines, threads, monitors, runtime, C, Cforall}
+\keywords{coroutines, concurrency, parallelism, threads, monitors, runtime, C, \CFA (Cforall)}
 …
 \section{Introduction}
+This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA~\cite{Moss18} and its runtime.
+\CFA is a modern, polymorphic, non-object-oriented\footnote{
+\CFA has features often associated with object-oriented programming languages, such as constructors, destructors, virtuals and simple inheritance.
+However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions (member/method) implemented by an implicit \lstinline@this@ (receiver) parameter.},
+backwards-compatible extension of the C programming language.
+Within the \CFA framework, new control-flow features are created from scratch.
+ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}.
+However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}.
+Furthermore, \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone;
+no high-level language concurrency features are defined.
+Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
+Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
+In contrast, there has been a renewed interest during the past decade in user-level (M:N, green) threading in old and new programming languages.
+As multi-core hardware became available in the 1980/90s, both user and kernel threading were examined.
+Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}.
+Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads.
+As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms.
+From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,BoostThreads}, including putting green threads back into Java~\cite{Quasar}.
+The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing smaller work-units to facilitate load balancing by the runtime~\cite{Verch12}.
+As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}.
+Finally, performant user-threading implementations (both time and space) are largely competitive with direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
+A further effort over the past two decades is the development of language memory-models to deal with the conflict between language features and compiler/hardware optimizations, i.e., some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}.
+The consequence is that a language must provide sufficient tools to program around safety issues, as inline and library code is all sequential to the compiler.
+One solution is low-level qualifiers and functions (e.g., @volatile@ and atomics) allowing \emph{programmers} to explicitly write safe (race-free~\cite{Boehm12}) programs.
+A safer solution is high-level language constructs so the \emph{compiler} knows the optimization boundaries, and hence, provides implicit safety.
+This problem is best know with respect to concurrency, but applies to other complex control-flow, like exceptions\footnote{
+\CFA exception handling will be presented in a separate paper.
+The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++}
+} and coroutines.
+Finally, solutions in the language allows matching constructs with language paradigm, i.e., imperative and functional languages have different presentations of the same concept.
+Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
+Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signalling-as-hints~\cite[\S~8]{Buhr05a}, where one begats the other.
+If you believe spurious wakeup is a foundational concurrency property, than unblocking (signalling) a thread is always a hint.
+If you \emph{do not} believe spurious wakeup is foundational, than signalling-as-hints is a performance decision.
+Most importantly, removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism.
+Clawing back performance where the local non-determinism is unimportant, should be an option not the default.
+\begin{comment}
+For example, it is possible to provide exceptions, coroutines, monitors, and tasks as specialized types in an object-oriented language, integrating these constructs to allow leveraging the type-system (static type-checking) and all other object-oriented capabilities~\cite{uC++}.
+It is also possible to leverage call/return for blocking communication via new control structures, versus switching to alternative communication paradigms, like channels or message passing.
+As well, user threading is often a complementary feature, allowing light-weight threading to match with low-cost objects, while hiding the application/kernel boundary.
+User threading also allows layering of implicit concurrency models (no explicit thread creation), such executors, data-flow, actors, into a single language, so programmers can chose the model that best fits an algorithm.\footnote{
+All implicit concurrency models have explicit threading in their implementation, and hence, can be build from explicit threading;
+however, the reverse is seldom true, i.e., given implicit concurrency, e.g., actors, it is virtually impossible to create explicit concurrency, e.g., blocking thread objects.}
+Finally, with extended language features and user-level threading it is possible to discretely fold locking and non-blocking I/O multiplexing into the language's I/O libraries, so threading implicitly dovetails with the I/O subsystem.
+\CFA embraces language extensions and user-level threading to provide advanced control-flow (exception handling\footnote{
+\CFA exception handling will be presented in a separate paper.
+The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++}
+} and coroutines) and concurrency.
+Most augmented traditional (Fortran 18~\cite{Fortran18}, Cobol 14~\cite{Cobol14}, Ada 12~\cite{Ada12}, Java 11~\cite{Java11}) and new languages (Go~\cite{Go}, Rust~\cite{Rust}, and D~\cite{D}), except \CC, diverge from C with different syntax and semantics, only interoperate indirectly with C, and are not systems languages, for those with managed memory.
+As a result, there is a significant learning curve to move to these languages, and C legacy-code must be rewritten.
+While \CC, like \CFA, takes an evolutionary approach to extend C, \CC's constantly growing complex and interdependent features-set (e.g., objects, inheritance, templates, etc.) mean idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
+Hence, rewriting and retraining costs for these languages, even \CC, are prohibitive for companies with a large C software-base.
+\CFA with its orthogonal feature-set, its high-performance runtime, and direct access to all existing C libraries circumvents these problems.
+\end{comment}
+\CFA embraces user-level threading, language extensions for advanced control-flow, and safety as the default.
+We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC and other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
+The main contributions of this work are:
+\begin{itemize}
+\item
+expressive language-level coroutines and user-level threading, which respect the expectations of C programmers.
+\item
+monitor synchronization without barging.
+\item
+safely acquiring multiple monitors \emph{simultaneously} (deadlock free), while seamlessly integrating this capability with all monitor synchronization mechanisms.
+\item
+providing statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features.
+\item
+library extensions for executors, futures, and actors built on the basic mechanisms.
+\item
+a runtime system with no spurious wakeup.
+\item
+experimental results showing comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
+\end{itemize}
+\begin{comment}
 This paper provides a minimal concurrency \newterm{Application Program Interface} (API) that is simple, efficient and can be used to build other concurrency features.
 While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master.
 …
 The proposed concurrency API is implemented in a dialect of C, called \CFA (pronounced C-for-all).
 The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantics, and type checking, and the corresponding high-performance runtime-library to implement the concurrent features.
+\end{comment}
+\begin{comment}
 \section{\CFA Overview}
 …
 \end{cfa}
 where the return type supplies the type/size of the allocation, which is impossible in most type systems.
+\section{Concurrency}
+\label{s:Concurrency}
+At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks.
+Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}.
+In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable.
+A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming;
+a \newterm{stackful} coroutine executes on its own stack, allowing full generality.
+Only stackful coroutines are a stepping stone to concurrency.
+The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}.
+Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next.
+The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}.
+Because the scheduler is special, it can either be a stackless or stackful coroutine.
+For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch.
+For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches.
+A stackful scheduler is often used for simplicity and security.
+Regardless of the approach used, a subset of concurrency related challenges start to appear.
+For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}.
+While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur.
+Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors.
+The reason is that only the runtime has complete knowledge about resources and how to best utilized them.
+However, the introduction of unrestricted non-determinism results in the need for \newterm{mutual exclusion} and \newterm{synchronization} to restrict non-determinism for correctness;
+otherwise, it is impossible to write meaningful programs.
+Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows.
+An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional.
+As such, library support for threading is far from widespread.
+At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}.
+In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
+As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages.
+Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay.
+Hence, concurrent programs should be written using high-level mechanisms, and only step down to lower-level mechanisms when performance bottlenecks are encountered.
+\subsection{Coroutines: A Stepping Stone}\label{coroutine}
+While the focus of this discussion is concurrency and parallelism, it is important to address coroutines, which are a significant building block of a concurrency system (but not concurrent among themselves).
+\end{comment}
+\section{Coroutines: Stepping Stone}
+\label{coroutine}
 Coroutines are generalized routines allowing execution to be temporarily suspended and later resumed.
 Hence, unlike a normal routine, a coroutine may not terminate when it returns to its caller, allowing it to be restarted with the values and execution location present at the point of suspension.
 …
 \centering
 \newbox\myboxA
+% \begin{lrbox}{\myboxA}
+% \begin{cfa}[aboveskip=0pt,belowskip=0pt]
+% `int fn1, fn2, state = 1;`   // single global variables
+% int fib() {
+%       int fn;
+%       `switch ( state )` {  // explicit execution state
+%         case 1: fn = 0;  fn1 = fn;  state = 2;  break;
+%         case 2: fn = 1;  fn2 = fn1;  fn1 = fn;  state = 3;  break;
+%         case 3: fn = fn1 + fn2;  fn2 = fn1;  fn1 = fn;  break;
+%       }
+%       return fn;
+% }
+% int main() {
+%
+%       for ( int i = 0; i < 10; i += 1 ) {
+%               printf( "%d\n", fib() );
+%       }
+% }
+% \end{cfa}
+% \end{lrbox}
 \begin{lrbox}{\myboxA}
 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
+`int f1, f2, state = 1;`   // single global variables
+int fib() {
+        int fn;
+        `switch ( state )` {  // explicit execution state
+          case 1: fn = 0;  f1 = fn;  state = 2;  break;
+          case 2: fn = 1;  f2 = f1;  f1 = fn;  state = 3;  break;
+          case 3: fn = f1 + f2;  f2 = f1;  f1 = fn;  break;
+        }
+        return fn;
+}
+#define FIB_INIT { 0, 1 }
+typedef struct { int fn1, fn; } Fib;
+int fib( Fib * f ) {
+        int ret = f->fn1;
+        f->fn1 = f->fn;
+        f->fn = ret + f->fn;
+        return ret;
+}
 int main() {
+        Fib f1 = FIB_INIT, f2 = FIB_INIT;
         for ( int i = 0; i < 10; i += 1 ) {
+                printf( "%d\n", fib() );
+                printf( "%d %d\n",
+                                fib( &f1 ), fib( &f2 ) );
+        }
+}
 …
 \begin{lrbox}{\myboxB}
 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
+#define FIB_INIT `{ 0, 1 }`
+typedef struct { int f2, f1; } Fib;
+int fib( Fib * f ) {
+        int ret = f->f2;
+        int fn = f->f1 + f->f2;
+        f->f2 = f->f1; f->f1 = fn;
+        return ret;
+}
+int main() {
+        Fib f1 = FIB_INIT, f2 = FIB_INIT;
+        for ( int i = 0; i < 10; i += 1 ) {
+                printf( "%d %d\n", fib( &f1 ), fib( &f2 ) );
+`coroutine` Fib { int fn1; };
+void main( Fib & fib ) with( fib ) {
+        int fn;
+        [fn1, fn] = [0, 1];
+        for () {
+                `suspend();`
+                [fn1, fn] = [fn, fn1 + fn];
+        }
+}
+\end{cfa}
+\end{lrbox}
+\subfloat[3 States: global variables]{\label{f:GlobalVariables}\usebox\myboxA}
+\qquad
+\subfloat[1 State: external variables]{\label{f:ExternalState}\usebox\myboxB}
+\caption{C Fibonacci Implementations}
+\label{f:C-fibonacci}
+\bigskip
+\newbox\myboxA
+\begin{lrbox}{\myboxA}
+\begin{cfa}[aboveskip=0pt,belowskip=0pt]
+`coroutine` Fib { int fn; };
+void main( Fib & fib ) with( fib ) {
+        int f1, f2;
+        fn = 0;  f1 = fn;  `suspend()`;
+        fn = 1;  f2 = f1;  f1 = fn;  `suspend()`;
+        for ( ;; ) {
+                fn = f1 + f2;  f2 = f1;  f1 = fn;  `suspend()`;
+        }
+}
+int next( Fib & fib ) with( fib ) {
+        `resume( fib );`
+        return fn;
+int ?()( Fib & fib ) with( fib ) {
+        `resume( fib );`  return fn1;
+}
 int main() {
         Fib f1, f2;
+        for ( int i = 1; i <= 10; i += 1 ) {
+                sout | next( f1 ) | next( f2 );
+        }
+}
+        for ( 10 ) {
+                sout | f1() | f2();
+}
 \end{cfa}
 \end{lrbox}
+\newbox\myboxB
+\begin{lrbox}{\myboxB}
+\begin{cfa}[aboveskip=0pt,belowskip=0pt]
+`coroutine` Fib { int ret; };
+void main( Fib & f ) with( fib ) {
+        int fn, f1 = 1, f2 = 0;
+        for ( ;; ) {
+                ret = f2;
+                fn = f1 + f2;  f2 = f1;  f1 = fn; `suspend();`
+        }
+}
+int next( Fib & fib ) with( fib ) {
+        `resume( fib );`
+        return ret;
+}
+\end{cfa}
+\newbox\myboxC
+\begin{lrbox}{\myboxC}
+\begin{python}[aboveskip=0pt,belowskip=0pt]
+def Fib():
+    fn1, fn = 0, 1
+    while True:
+        `yield fn1`
+        fn1, fn = fn, fn1 + fn
+// next prewritten
+f1 = Fib()
+f2 = Fib()
+for i in range( 10 ):
+        print( next( f1 ), next( f2 ) )
+\end{python}
 \end{lrbox}
+\subfloat[3 States, internal variables]{\label{f:Coroutine3States}\usebox\myboxA}
+\qquad\qquad
+\subfloat[1 State, internal variables]{\label{f:Coroutine1State}\usebox\myboxB}
+\caption{\CFA Coroutine Fibonacci Implementations}
+\label{f:cfa-fibonacci}
+\subfloat[C]{\label{f:GlobalVariables}\usebox\myboxA}
+\hspace{3pt}
+\vrule
+\hspace{3pt}
+\subfloat[\CFA]{\label{f:ExternalState}\usebox\myboxB}
+\hspace{3pt}
+\vrule
+\hspace{3pt}
+\subfloat[Python]{\label{f:ExternalState}\usebox\myboxC}
+\caption{Fibonacci Generator}
+\label{f:C-fibonacci}
+% \bigskip
+%
+% \newbox\myboxA
+% \begin{lrbox}{\myboxA}
+% \begin{cfa}[aboveskip=0pt,belowskip=0pt]
+% `coroutine` Fib { int fn; };
+% void main( Fib & fib ) with( fib ) {
+%       fn = 0;  int fn1 = fn; `suspend()`;
+%       fn = 1;  int fn2 = fn1;  fn1 = fn; `suspend()`;
+%       for () {
+%               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; `suspend()`; }
+% }
+% int next( Fib & fib ) with( fib ) { `resume( fib );` return fn; }
+% int main() {
+%       Fib f1, f2;
+%       for ( 10 )
+%               sout | next( f1 ) | next( f2 );
+% }
+% \end{cfa}
+% \end{lrbox}
+% \newbox\myboxB
+% \begin{lrbox}{\myboxB}
+% \begin{python}[aboveskip=0pt,belowskip=0pt]
+%
+% def Fibonacci():
+%       fn = 0; fn1 = fn; `yield fn`  # suspend
+%       fn = 1; fn2 = fn1; fn1 = fn; `yield fn`
+%       while True:
+%               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; `yield fn`
+%
+%
+% f1 = Fibonacci()
+% f2 = Fibonacci()
+% for i in range( 10 ):
+%       print( `next( f1 )`, `next( f2 )` ) # resume
+%
+% \end{python}
+% \end{lrbox}
+% \subfloat[\CFA]{\label{f:Coroutine3States}\usebox\myboxA}
+% \qquad
+% \subfloat[Python]{\label{f:Coroutine1State}\usebox\myboxB}
+% \caption{Fibonacci input coroutine, 3 states, internal variables}
+% \label{f:cfa-fibonacci}
 \end{figure}
 …
 \begin{lrbox}{\myboxA}
 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
 `coroutine` Format {
         char ch;   // used for communication
         int g, b;  // global because used in destructor
+`coroutine` Fmt {
+        char ch;   // communication variables
+        int g, b;   // needed in destructor
 };
 void main( Format & fmt ) with( fmt ) {
         for ( ;; ) {
                 for ( g = 0; g < 5; g += 1 ) {      // group
                         for ( b = 0; b < 4; b += 1 ) { // block
+void main( Fmt & fmt ) with( fmt ) {
+        for () {
+                for ( g = 0; g < 5; g += 1 ) { // groups
+                        for ( b = 0; b < 4; b += 1 ) { // blocks
                                 `suspend();`
+                                sout | ch;              // separator
+                        }
+                        sout | "  ";               // separator
+                }
+                sout | nl;
+        }
+}
+void ?{}( Format & fmt ) { `resume( fmt );` }
+void ^?{}( Format & fmt ) with( fmt ) {
+        if ( g != 0 || b != 0 ) sout | nl;
+}
+void format( Format & fmt ) {
+        `resume( fmt );`
+}
+                                sout | ch; } // print character
+                        sout | "  "; } // block separator
+                sout | nl; }  // group separator
+}
+void ?{}( Fmt & fmt ) { `resume( fmt );` } // prime
+void ^?{}( Fmt & fmt ) with( fmt ) { // destructor
+        if ( g != 0 || b != 0 ) // special case
+                sout | nl; }
+void send( Fmt & fmt, char c ) { fmt.ch = c; `resume( fmt )`; }
 int main() {
+        Format fmt;
+        eof: for ( ;; ) {
+                sin | fmt.ch;
+          if ( eof( sin ) ) break eof;
+                format( fmt );
+        }
+        Fmt fmt;
+        sout | nlOff;   // turn off auto newline
+        for ( 41 )
+                send( fmt, 'a' );
+}
 \end{cfa}
 …
 \newbox\myboxB
 \begin{lrbox}{\myboxB}
+\begin{cfa}[aboveskip=0pt,belowskip=0pt]
+struct Format {
+        char ch;
+        int g, b;
+};
+void format( struct Format * fmt ) {
+        if ( fmt->ch != -1 ) {      // not EOF ?
+                printf( "%c", fmt->ch );
+                fmt->b += 1;
+                if ( fmt->b == 4 ) {  // block
+                        printf( "  " );      // separator
+                        fmt->b = 0;
+                        fmt->g += 1;
+                }
+                if ( fmt->g == 5 ) {  // group
+                        printf( "\n" );     // separator
+                        fmt->g = 0;
+                }
+        } else {
+                if ( fmt->g != 0 || fmt->b != 0 ) printf( "\n" );
+        }
+}
+int main() {
+        struct Format fmt = { 0, 0, 0 };
+        for ( ;; ) {
+                scanf( "%c", &fmt.ch );
+          if ( feof( stdin ) ) break;
+                format( &fmt );
+        }
+        fmt.ch = -1;
+        format( &fmt );
+}
+\end{cfa}
+\begin{python}[aboveskip=0pt,belowskip=0pt]
+def Fmt():
+        try:
+                while True:
+                        for g in range( 5 ):
+                                for b in range( 4 ):
+                                        print( `(yield)`, end='' )
+                                print( '  ', end='' )
+                        print()
+        except GeneratorExit:
+                if g != 0 | b != 0:
+                        print()
+fmt = Fmt()
+`next( fmt )`                    # prime
+for i in range( 41 ):
+        `fmt.send( 'a' );`      # send to yield
+\end{python}
 \end{lrbox}
 \subfloat[\CFA Coroutine]{\label{f:CFAFmt}\usebox\myboxA}
+\subfloat[\CFA]{\label{f:CFAFmt}\usebox\myboxA}
 \qquad
 \subfloat[C Linearized]{\label{f:CFmt}\usebox\myboxB}
 \caption{Formatting text into lines of 5 blocks of 4 characters.}
+\subfloat[Python]{\label{f:CFmt}\usebox\myboxB}
+\caption{Output formatting text}
 \label{f:fmt-line}
 \end{figure}
 …
 void main( Prod & prod ) with( prod ) {
         // 1st resume starts here
         for ( int i = 0; i < N; i += 1 ) {
+        for ( i; N ) {
                 int p1 = random( 100 ), p2 = random( 100 );
                 sout | p1 | " " | p2;
 …
+}
 void start( Prod & prod, int N, Cons &c ) {
         &prod.c = &c;
+        &prod.c = &c; // reassignable reference
         prod.[N, receipt] = [N, 0];
         `resume( prod );`
 …
         Prod & p;
         int p1, p2, status;
         _Bool done;
+        bool done;
 };
 void ?{}( Cons & cons, Prod & p ) {
         &cons.p = &p;
+        &cons.p = &p; // reassignable reference
         cons.[status, done ] = [0, false];
+}
 …
 @start@ returns and the program main terminates.
+One \emph{killer} application for a coroutine is device drivers, which at one time caused 70\%-85\% of failures in Windows/Linux~\cite{Swift05}.
+Many device drivers are a finite state-machine parsing a protocol, e.g.:
+\begin{tabbing}
+\ldots STX \= \ldots message \ldots \= ESC \= ETX \= \ldots message \ldots  \= ETX \= 2-byte crc \= \ldots      \kill
+\ldots STX \> \ldots message \ldots \> ESC \> ETX \> \ldots message \ldots  \> ETX \> 2-byte crc \> \ldots
+\end{tabbing}
+where a network message begins with the control character STX and ends with an ETX, followed by a 2-byte cyclic-redundancy check.
+Control characters may appear in a message if preceded by an ESC.
+Because FSMs can be complex and occur frequently in important domains, direct support of the coroutine is crucial in a systems programminglanguage.
+\begin{figure}
+\begin{cfa}
+enum Status { CONT, MSG, ESTX, ELNTH, ECRC };
+`coroutine` Driver {
+        Status status;
+        char * msg, byte;
+};
+void ?{}( Driver & d, char * m ) { d.msg = m; }         $\C[3.0in]{// constructor}$
+Status next( Driver & d, char b ) with( d ) {           $\C{// 'with' opens scope}$
+        byte = b; `resume( d );` return status;
+}
+void main( Driver & d ) with( d ) {
+        enum { STX = '\002', ESC = '\033', ETX = '\003', MaxMsg = 64 };
+        unsigned short int crc;                                                 $\C{// error checking}$
+  msg: for () {                                                                         $\C{// parse message}$
+                status = CONT;
+                unsigned int lnth = 0, sum = 0;
+                while ( byte != STX ) `suspend();`
+          emsg: for () {
+                        `suspend();`                                                    $\C{// process byte}$
+                        choose ( byte ) {                                               $\C{// switch with default break}$
+                          case STX:
+                                status = ESTX; `suspend();` continue msg;
+                          case ETX:
+                                break emsg;
+                          case ESC:
+                                suspend();
+                        } // choose
+                        if ( lnth >= MaxMsg ) {                                 $\C{// buffer full ?}$
+                                status = ELNTH; `suspend();` continue msg; }
+                        msg[lnth++] = byte;
+                        sum += byte;
+                } // for
+                msg[lnth] = '\0';                                                       $\C{// terminate string}\CRT$
+                `suspend();`
+                crc = (unsigned char)byte << 8; // prevent sign extension for signed char
+                `suspend();`
+                status = (crc | (unsigned char)byte) == sum ? MSG : ECRC;
+                `suspend();`
+        } // for
+}
+\end{cfa}
+\caption{Device driver for simple communication protocol}
+\end{figure}
 \subsection{Coroutine Implementation}
 …
 \end{cquote}
 The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization.
+\section{Concurrency}
+\label{s:Concurrency}
+At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks.
+Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}.
+In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable.
+A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming;
+a \newterm{stackful} coroutine executes on its own stack, allowing full generality.
+Only stackful coroutines are a stepping stone to concurrency.
+The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}.
+Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next.
+The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}.
+Because the scheduler is special, it can either be a stackless or stackful coroutine.
+For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch.
+For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches.
+A stackful scheduler is often used for simplicity and security.
+Regardless of the approach used, a subset of concurrency related challenges start to appear.
+For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}.
+While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur.
+Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors.
+The reason is that only the runtime has complete knowledge about resources and how to best utilized them.
+However, the introduction of unrestricted non-determinism results in the need for \newterm{mutual exclusion} and \newterm{synchronization} to restrict non-determinism for correctness;
+otherwise, it is impossible to write meaningful programs.
+Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows.
+An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional.
+As such, library support for threading is far from widespread.
+At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}.
+In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
+As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages.
+Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay.
+Hence, concurrent programs should be written using high-level mechanisms, and only step down to lower-level mechanisms when performance bottlenecks are encountered.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 933f32f for doc/papers/concurrency/Paper.tex

Legend:

doc/papers/concurrency/Paper.tex

Download in other formats: