Ignore:
Timestamp:
May 24, 2019, 10:19:41 AM (6 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, arm-eh, ast-experimental, cleanup-dtors, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
d908563
Parents:
6a9d4b4 (diff), 292642a (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' into cleanup-dtors

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r6a9d4b4 r933f32f  
    215215{}
    216216\lstnewenvironment{Go}[1][]
    217 {\lstset{#1}}
     217{\lstset{language=go,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     218{}
     219\lstnewenvironment{python}[1][]
     220{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
    218221{}
    219222
     
    228231}
    229232
    230 \title{\texorpdfstring{Concurrency in \protect\CFA}{Concurrency in Cforall}}
     233\title{\texorpdfstring{Advanced Control-flow and Concurrency in \protect\CFA}{Advanced Control-flow in Cforall}}
    231234
    232235\author[1]{Thierry Delisle}
     
    238241\corres{*Peter A. Buhr, Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada. \email{pabuhr{\char`\@}uwaterloo.ca}}
    239242
    240 \fundingInfo{Natural Sciences and Engineering Research Council of Canada}
     243% \fundingInfo{Natural Sciences and Engineering Research Council of Canada}
    241244
    242245\abstract[Summary]{
    243 \CFA is a modern, polymorphic, \emph{non-object-oriented} extension of the C programming language.
    244 This paper discusses the design of the concurrency and parallelism features in \CFA, and its concurrent runtime-system.
    245 These features are created from scratch as ISO C lacks concurrency, relying largely on the pthreads library for concurrency.
    246 Coroutines and lightweight (user) threads are introduced into \CFA;
    247 as well, monitors are added as a high-level mechanism for mutual exclusion and synchronization.
    248 A unique contribution of this work is allowing multiple monitors to be safely acquired \emph{simultaneously}.
    249 All features respect the expectations of C programmers, while being fully integrate with the \CFA polymorphic type-system and other language features.
     246\CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
     247This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime.
     248These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like C pthreads.
     249\CFA introduces modern language-level control-flow mechanisms, like coroutines, user-level threading, and monitors for mutual exclusion and synchronization.
     250Library extension for executors, futures, and actors are built on these basic mechanisms.
     251The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and reducing monitor barging.
     252The runtime also ensures multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), and this feature is fully integrated with all monitor synchronization mechanisms.
     253All language features integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers.
    250254Experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
    251255}%
    252256
    253 \keywords{concurrency, parallelism, coroutines, threads, monitors, runtime, C, Cforall}
     257\keywords{coroutines, concurrency, parallelism, threads, monitors, runtime, C, \CFA (Cforall)}
    254258
    255259
     
    262266\section{Introduction}
    263267
     268This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA~\cite{Moss18} and its runtime.
     269\CFA is a modern, polymorphic, non-object-oriented\footnote{
     270\CFA has features often associated with object-oriented programming languages, such as constructors, destructors, virtuals and simple inheritance.
     271However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions (member/method) implemented by an implicit \lstinline@this@ (receiver) parameter.},
     272backwards-compatible extension of the C programming language.
     273Within the \CFA framework, new control-flow features are created from scratch.
     274ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}.
     275However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}.
     276Furthermore, \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone;
     277no high-level language concurrency features are defined.
     278Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
     279Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
     280
     281In contrast, there has been a renewed interest during the past decade in user-level (M:N, green) threading in old and new programming languages.
     282As multi-core hardware became available in the 1980/90s, both user and kernel threading were examined.
     283Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}.
     284Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads.
     285As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms.
     286From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,BoostThreads}, including putting green threads back into Java~\cite{Quasar}.
     287The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing smaller work-units to facilitate load balancing by the runtime~\cite{Verch12}.
     288As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}.
     289Finally, performant user-threading implementations (both time and space) are largely competitive with direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
     290
     291A further effort over the past two decades is the development of language memory-models to deal with the conflict between language features and compiler/hardware optimizations, i.e., some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}.
     292The consequence is that a language must provide sufficient tools to program around safety issues, as inline and library code is all sequential to the compiler.
     293One solution is low-level qualifiers and functions (e.g., @volatile@ and atomics) allowing \emph{programmers} to explicitly write safe (race-free~\cite{Boehm12}) programs.
     294A safer solution is high-level language constructs so the \emph{compiler} knows the optimization boundaries, and hence, provides implicit safety.
     295This problem is best know with respect to concurrency, but applies to other complex control-flow, like exceptions\footnote{
     296\CFA exception handling will be presented in a separate paper.
     297The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++}
     298} and coroutines.
     299Finally, solutions in the language allows matching constructs with language paradigm, i.e., imperative and functional languages have different presentations of the same concept.
     300
     301Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
     302Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signalling-as-hints~\cite[\S~8]{Buhr05a}, where one begats the other.
     303If you believe spurious wakeup is a foundational concurrency property, than unblocking (signalling) a thread is always a hint.
     304If you \emph{do not} believe spurious wakeup is foundational, than signalling-as-hints is a performance decision.
     305Most importantly, removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism.
     306Clawing back performance where the local non-determinism is unimportant, should be an option not the default.
     307
     308\begin{comment}
     309For example, it is possible to provide exceptions, coroutines, monitors, and tasks as specialized types in an object-oriented language, integrating these constructs to allow leveraging the type-system (static type-checking) and all other object-oriented capabilities~\cite{uC++}.
     310It is also possible to leverage call/return for blocking communication via new control structures, versus switching to alternative communication paradigms, like channels or message passing.
     311As well, user threading is often a complementary feature, allowing light-weight threading to match with low-cost objects, while hiding the application/kernel boundary.
     312User threading also allows layering of implicit concurrency models (no explicit thread creation), such executors, data-flow, actors, into a single language, so programmers can chose the model that best fits an algorithm.\footnote{
     313All implicit concurrency models have explicit threading in their implementation, and hence, can be build from explicit threading;
     314however, the reverse is seldom true, i.e., given implicit concurrency, e.g., actors, it is virtually impossible to create explicit concurrency, e.g., blocking thread objects.}
     315Finally, with extended language features and user-level threading it is possible to discretely fold locking and non-blocking I/O multiplexing into the language's I/O libraries, so threading implicitly dovetails with the I/O subsystem.
     316\CFA embraces language extensions and user-level threading to provide advanced control-flow (exception handling\footnote{
     317\CFA exception handling will be presented in a separate paper.
     318The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++}
     319} and coroutines) and concurrency.
     320
     321Most augmented traditional (Fortran 18~\cite{Fortran18}, Cobol 14~\cite{Cobol14}, Ada 12~\cite{Ada12}, Java 11~\cite{Java11}) and new languages (Go~\cite{Go}, Rust~\cite{Rust}, and D~\cite{D}), except \CC, diverge from C with different syntax and semantics, only interoperate indirectly with C, and are not systems languages, for those with managed memory.
     322As a result, there is a significant learning curve to move to these languages, and C legacy-code must be rewritten.
     323While \CC, like \CFA, takes an evolutionary approach to extend C, \CC's constantly growing complex and interdependent features-set (e.g., objects, inheritance, templates, etc.) mean idiomatic \CC code is difficult to use from C, and C programmers must expend significant effort learning \CC.
     324Hence, rewriting and retraining costs for these languages, even \CC, are prohibitive for companies with a large C software-base.
     325\CFA with its orthogonal feature-set, its high-performance runtime, and direct access to all existing C libraries circumvents these problems.
     326\end{comment}
     327
     328\CFA embraces user-level threading, language extensions for advanced control-flow, and safety as the default.
     329We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC and other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
     330The main contributions of this work are:
     331\begin{itemize}
     332\item
     333expressive language-level coroutines and user-level threading, which respect the expectations of C programmers.
     334\item
     335monitor synchronization without barging.
     336\item
     337safely acquiring multiple monitors \emph{simultaneously} (deadlock free), while seamlessly integrating this capability with all monitor synchronization mechanisms.
     338\item
     339providing statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features.
     340\item
     341library extensions for executors, futures, and actors built on the basic mechanisms.
     342\item
     343a runtime system with no spurious wakeup.
     344\item
     345experimental results showing comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
     346\end{itemize}
     347
     348\begin{comment}
    264349This paper provides a minimal concurrency \newterm{Application Program Interface} (API) that is simple, efficient and can be used to build other concurrency features.
    265350While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master.
     
    281366The proposed concurrency API is implemented in a dialect of C, called \CFA (pronounced C-for-all).
    282367The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantics, and type checking, and the corresponding high-performance runtime-library to implement the concurrent features.
    283 
    284 
     368\end{comment}
     369
     370
     371\begin{comment}
    285372\section{\CFA Overview}
    286373
     
    551638\end{cfa}
    552639where the return type supplies the type/size of the allocation, which is impossible in most type systems.
    553 
    554 
    555 \section{Concurrency}
    556 \label{s:Concurrency}
    557 
    558 At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks.
    559 Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}.
    560 In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable.
    561 A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming;
    562 a \newterm{stackful} coroutine executes on its own stack, allowing full generality.
    563 Only stackful coroutines are a stepping stone to concurrency.
    564 
    565 The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}.
    566 Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next.
    567 The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}.
    568 
    569 Because the scheduler is special, it can either be a stackless or stackful coroutine.
    570 For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch.
    571 For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches.
    572 A stackful scheduler is often used for simplicity and security.
    573 
    574 Regardless of the approach used, a subset of concurrency related challenges start to appear.
    575 For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}.
    576 While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur.
    577 Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors.
    578 The reason is that only the runtime has complete knowledge about resources and how to best utilized them.
    579 However, the introduction of unrestricted non-determinism results in the need for \newterm{mutual exclusion} and \newterm{synchronization} to restrict non-determinism for correctness;
    580 otherwise, it is impossible to write meaningful programs.
    581 Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows.
    582 
    583 An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional.
    584 As such, library support for threading is far from widespread.
    585 At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}.
    586 In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
    587 As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages.
    588 Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay.
    589 Hence, concurrent programs should be written using high-level mechanisms, and only step down to lower-level mechanisms when performance bottlenecks are encountered.
    590 
    591 
    592 \subsection{Coroutines: A Stepping Stone}\label{coroutine}
    593 
    594 While the focus of this discussion is concurrency and parallelism, it is important to address coroutines, which are a significant building block of a concurrency system (but not concurrent among themselves).
     640\end{comment}
     641
     642
     643\section{Coroutines: Stepping Stone}
     644\label{coroutine}
     645
    595646Coroutines are generalized routines allowing execution to be temporarily suspended and later resumed.
    596647Hence, unlike a normal routine, a coroutine may not terminate when it returns to its caller, allowing it to be restarted with the values and execution location present at the point of suspension.
     
    616667\centering
    617668\newbox\myboxA
     669% \begin{lrbox}{\myboxA}
     670% \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     671% `int fn1, fn2, state = 1;`   // single global variables
     672% int fib() {
     673%       int fn;
     674%       `switch ( state )` {  // explicit execution state
     675%         case 1: fn = 0;  fn1 = fn;  state = 2;  break;
     676%         case 2: fn = 1;  fn2 = fn1;  fn1 = fn;  state = 3;  break;
     677%         case 3: fn = fn1 + fn2;  fn2 = fn1;  fn1 = fn;  break;
     678%       }
     679%       return fn;
     680% }
     681% int main() {
     682%
     683%       for ( int i = 0; i < 10; i += 1 ) {
     684%               printf( "%d\n", fib() );
     685%       }
     686% }
     687% \end{cfa}
     688% \end{lrbox}
    618689\begin{lrbox}{\myboxA}
    619690\begin{cfa}[aboveskip=0pt,belowskip=0pt]
    620 `int f1, f2, state = 1;`   // single global variables
    621 int fib() {
    622         int fn;
    623         `switch ( state )` {  // explicit execution state
    624           case 1: fn = 0;  f1 = fn;  state = 2;  break;
    625           case 2: fn = 1;  f2 = f1;  f1 = fn;  state = 3;  break;
    626           case 3: fn = f1 + f2;  f2 = f1;  f1 = fn;  break;
    627         }
    628         return fn;
    629 }
     691#define FIB_INIT { 0, 1 }
     692typedef struct { int fn1, fn; } Fib;
     693int fib( Fib * f ) {
     694
     695        int ret = f->fn1;
     696        f->fn1 = f->fn;
     697        f->fn = ret + f->fn;
     698        return ret;
     699}
     700
     701
     702
    630703int main() {
    631 
     704        Fib f1 = FIB_INIT, f2 = FIB_INIT;
    632705        for ( int i = 0; i < 10; i += 1 ) {
    633                 printf( "%d\n", fib() );
     706                printf( "%d %d\n",
     707                                fib( &f1 ), fib( &f2 ) );
    634708        }
    635709}
     
    640714\begin{lrbox}{\myboxB}
    641715\begin{cfa}[aboveskip=0pt,belowskip=0pt]
    642 #define FIB_INIT `{ 0, 1 }`
    643 typedef struct { int f2, f1; } Fib;
    644 int fib( Fib * f ) {
    645 
    646         int ret = f->f2;
    647         int fn = f->f1 + f->f2;
    648         f->f2 = f->f1; f->f1 = fn;
    649 
    650         return ret;
    651 }
    652 int main() {
    653         Fib f1 = FIB_INIT, f2 = FIB_INIT;
    654         for ( int i = 0; i < 10; i += 1 ) {
    655                 printf( "%d %d\n", fib( &f1 ), fib( &f2 ) );
     716`coroutine` Fib { int fn1; };
     717void main( Fib & fib ) with( fib ) {
     718        int fn;
     719        [fn1, fn] = [0, 1];
     720        for () {
     721                `suspend();`
     722                [fn1, fn] = [fn, fn1 + fn];
    656723        }
    657724}
    658 \end{cfa}
    659 \end{lrbox}
    660 
    661 \subfloat[3 States: global variables]{\label{f:GlobalVariables}\usebox\myboxA}
    662 \qquad
    663 \subfloat[1 State: external variables]{\label{f:ExternalState}\usebox\myboxB}
    664 \caption{C Fibonacci Implementations}
    665 \label{f:C-fibonacci}
    666 
    667 \bigskip
    668 
    669 \newbox\myboxA
    670 \begin{lrbox}{\myboxA}
    671 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    672 `coroutine` Fib { int fn; };
    673 void main( Fib & fib ) with( fib ) {
    674         int f1, f2;
    675         fn = 0;  f1 = fn;  `suspend()`;
    676         fn = 1;  f2 = f1;  f1 = fn;  `suspend()`;
    677         for ( ;; ) {
    678                 fn = f1 + f2;  f2 = f1;  f1 = fn;  `suspend()`;
    679         }
    680 }
    681 int next( Fib & fib ) with( fib ) {
    682         `resume( fib );`
    683         return fn;
     725int ?()( Fib & fib ) with( fib ) {
     726        `resume( fib );`  return fn1;
    684727}
    685728int main() {
    686729        Fib f1, f2;
    687         for ( int i = 1; i <= 10; i += 1 ) {
    688                 sout | next( f1 ) | next( f2 );
    689         }
    690 }
     730        for ( 10 ) {
     731                sout | f1() | f2();
     732}
     733
     734
    691735\end{cfa}
    692736\end{lrbox}
    693 \newbox\myboxB
    694 \begin{lrbox}{\myboxB}
    695 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    696 `coroutine` Fib { int ret; };
    697 void main( Fib & f ) with( fib ) {
    698         int fn, f1 = 1, f2 = 0;
    699         for ( ;; ) {
    700                 ret = f2;
    701 
    702                 fn = f1 + f2;  f2 = f1;  f1 = fn; `suspend();`
    703         }
    704 }
    705 int next( Fib & fib ) with( fib ) {
    706         `resume( fib );`
    707         return ret;
    708 }
    709 
    710 
    711 
    712 
    713 
    714 
    715 \end{cfa}
     737
     738\newbox\myboxC
     739\begin{lrbox}{\myboxC}
     740\begin{python}[aboveskip=0pt,belowskip=0pt]
     741
     742def Fib():
     743
     744    fn1, fn = 0, 1
     745    while True:
     746        `yield fn1`
     747        fn1, fn = fn, fn1 + fn
     748
     749
     750// next prewritten
     751
     752
     753f1 = Fib()
     754f2 = Fib()
     755for i in range( 10 ):
     756        print( next( f1 ), next( f2 ) )
     757
     758
     759
     760\end{python}
    716761\end{lrbox}
    717 \subfloat[3 States, internal variables]{\label{f:Coroutine3States}\usebox\myboxA}
    718 \qquad\qquad
    719 \subfloat[1 State, internal variables]{\label{f:Coroutine1State}\usebox\myboxB}
    720 \caption{\CFA Coroutine Fibonacci Implementations}
    721 \label{f:cfa-fibonacci}
     762
     763\subfloat[C]{\label{f:GlobalVariables}\usebox\myboxA}
     764\hspace{3pt}
     765\vrule
     766\hspace{3pt}
     767\subfloat[\CFA]{\label{f:ExternalState}\usebox\myboxB}
     768\hspace{3pt}
     769\vrule
     770\hspace{3pt}
     771\subfloat[Python]{\label{f:ExternalState}\usebox\myboxC}
     772\caption{Fibonacci Generator}
     773\label{f:C-fibonacci}
     774
     775% \bigskip
     776%
     777% \newbox\myboxA
     778% \begin{lrbox}{\myboxA}
     779% \begin{cfa}[aboveskip=0pt,belowskip=0pt]
     780% `coroutine` Fib { int fn; };
     781% void main( Fib & fib ) with( fib ) {
     782%       fn = 0;  int fn1 = fn; `suspend()`;
     783%       fn = 1;  int fn2 = fn1;  fn1 = fn; `suspend()`;
     784%       for () {
     785%               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; `suspend()`; }
     786% }
     787% int next( Fib & fib ) with( fib ) { `resume( fib );` return fn; }
     788% int main() {
     789%       Fib f1, f2;
     790%       for ( 10 )
     791%               sout | next( f1 ) | next( f2 );
     792% }
     793% \end{cfa}
     794% \end{lrbox}
     795% \newbox\myboxB
     796% \begin{lrbox}{\myboxB}
     797% \begin{python}[aboveskip=0pt,belowskip=0pt]
     798%
     799% def Fibonacci():
     800%       fn = 0; fn1 = fn; `yield fn`  # suspend
     801%       fn = 1; fn2 = fn1; fn1 = fn; `yield fn`
     802%       while True:
     803%               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; `yield fn`
     804%
     805%
     806% f1 = Fibonacci()
     807% f2 = Fibonacci()
     808% for i in range( 10 ):
     809%       print( `next( f1 )`, `next( f2 )` ) # resume
     810%
     811% \end{python}
     812% \end{lrbox}
     813% \subfloat[\CFA]{\label{f:Coroutine3States}\usebox\myboxA}
     814% \qquad
     815% \subfloat[Python]{\label{f:Coroutine1State}\usebox\myboxB}
     816% \caption{Fibonacci input coroutine, 3 states, internal variables}
     817% \label{f:cfa-fibonacci}
    722818\end{figure}
    723819
     
    759855\begin{lrbox}{\myboxA}
    760856\begin{cfa}[aboveskip=0pt,belowskip=0pt]
    761 `coroutine` Format {
    762         char ch;   // used for communication
    763         int g, b;  // global because used in destructor
     857`coroutine` Fmt {
     858        char ch;   // communication variables
     859        int g, b;   // needed in destructor
    764860};
    765 void main( Format & fmt ) with( fmt ) {
    766         for ( ;; ) {
    767                 for ( g = 0; g < 5; g += 1 ) {      // group
    768                         for ( b = 0; b < 4; b += 1 ) { // block
     861void main( Fmt & fmt ) with( fmt ) {
     862        for () {
     863                for ( g = 0; g < 5; g += 1 ) { // groups
     864                        for ( b = 0; b < 4; b += 1 ) { // blocks
    769865                                `suspend();`
    770                                 sout | ch;              // separator
    771                         }
    772                         sout | "  ";               // separator
    773                 }
    774                 sout | nl;
    775         }
    776 }
    777 void ?{}( Format & fmt ) { `resume( fmt );` }
    778 void ^?{}( Format & fmt ) with( fmt ) {
    779         if ( g != 0 || b != 0 ) sout | nl;
    780 }
    781 void format( Format & fmt ) {
    782         `resume( fmt );`
    783 }
     866                                sout | ch; } // print character
     867                        sout | "  "; } // block separator
     868                sout | nl; }  // group separator
     869}
     870void ?{}( Fmt & fmt ) { `resume( fmt );` } // prime
     871void ^?{}( Fmt & fmt ) with( fmt ) { // destructor
     872        if ( g != 0 || b != 0 ) // special case
     873                sout | nl; }
     874void send( Fmt & fmt, char c ) { fmt.ch = c; `resume( fmt )`; }
    784875int main() {
    785         Format fmt;
    786         eof: for ( ;; ) {
    787                 sin | fmt.ch;
    788           if ( eof( sin ) ) break eof;
    789                 format( fmt );
    790         }
     876        Fmt fmt;
     877        sout | nlOff;   // turn off auto newline
     878        for ( 41 )
     879                send( fmt, 'a' );
    791880}
    792881\end{cfa}
     
    795884\newbox\myboxB
    796885\begin{lrbox}{\myboxB}
    797 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    798 struct Format {
    799         char ch;
    800         int g, b;
    801 };
    802 void format( struct Format * fmt ) {
    803         if ( fmt->ch != -1 ) {      // not EOF ?
    804                 printf( "%c", fmt->ch );
    805                 fmt->b += 1;
    806                 if ( fmt->b == 4 ) {  // block
    807                         printf( "  " );      // separator
    808                         fmt->b = 0;
    809                         fmt->g += 1;
    810                 }
    811                 if ( fmt->g == 5 ) {  // group
    812                         printf( "\n" );     // separator
    813                         fmt->g = 0;
    814                 }
    815         } else {
    816                 if ( fmt->g != 0 || fmt->b != 0 ) printf( "\n" );
    817         }
    818 }
    819 int main() {
    820         struct Format fmt = { 0, 0, 0 };
    821         for ( ;; ) {
    822                 scanf( "%c", &fmt.ch );
    823           if ( feof( stdin ) ) break;
    824                 format( &fmt );
    825         }
    826         fmt.ch = -1;
    827         format( &fmt );
    828 }
    829 \end{cfa}
     886\begin{python}[aboveskip=0pt,belowskip=0pt]
     887
     888
     889
     890def Fmt():
     891        try:
     892                while True:
     893                        for g in range( 5 ):
     894                                for b in range( 4 ):
     895
     896                                        print( `(yield)`, end='' )
     897                                print( '  ', end='' )
     898                        print()
     899
     900
     901        except GeneratorExit:
     902                if g != 0 | b != 0:
     903                        print()
     904
     905
     906fmt = Fmt()
     907`next( fmt )`                    # prime
     908for i in range( 41 ):
     909        `fmt.send( 'a' );`      # send to yield
     910
     911\end{python}
    830912\end{lrbox}
    831 \subfloat[\CFA Coroutine]{\label{f:CFAFmt}\usebox\myboxA}
     913\subfloat[\CFA]{\label{f:CFAFmt}\usebox\myboxA}
    832914\qquad
    833 \subfloat[C Linearized]{\label{f:CFmt}\usebox\myboxB}
    834 \caption{Formatting text into lines of 5 blocks of 4 characters.}
     915\subfloat[Python]{\label{f:CFmt}\usebox\myboxB}
     916\caption{Output formatting text}
    835917\label{f:fmt-line}
    836918\end{figure}
     
    853935void main( Prod & prod ) with( prod ) {
    854936        // 1st resume starts here
    855         for ( int i = 0; i < N; i += 1 ) {
     937        for ( i; N ) {
    856938                int p1 = random( 100 ), p2 = random( 100 );
    857939                sout | p1 | " " | p2;
     
    869951}
    870952void start( Prod & prod, int N, Cons &c ) {
    871         &prod.c = &c;
     953        &prod.c = &c; // reassignable reference
    872954        prod.[N, receipt] = [N, 0];
    873955        `resume( prod );`
     
    884966        Prod & p;
    885967        int p1, p2, status;
    886         _Bool done;
     968        bool done;
    887969};
    888970void ?{}( Cons & cons, Prod & p ) {
    889         &cons.p = &p;
     971        &cons.p = &p; // reassignable reference
    890972        cons.[status, done ] = [0, false];
    891973}
     
    9451027@start@ returns and the program main terminates.
    9461028
     1029One \emph{killer} application for a coroutine is device drivers, which at one time caused 70\%-85\% of failures in Windows/Linux~\cite{Swift05}.
     1030Many device drivers are a finite state-machine parsing a protocol, e.g.:
     1031\begin{tabbing}
     1032\ldots STX \= \ldots message \ldots \= ESC \= ETX \= \ldots message \ldots  \= ETX \= 2-byte crc \= \ldots      \kill
     1033\ldots STX \> \ldots message \ldots \> ESC \> ETX \> \ldots message \ldots  \> ETX \> 2-byte crc \> \ldots
     1034\end{tabbing}
     1035where a network message begins with the control character STX and ends with an ETX, followed by a 2-byte cyclic-redundancy check.
     1036Control characters may appear in a message if preceded by an ESC.
     1037Because FSMs can be complex and occur frequently in important domains, direct support of the coroutine is crucial in a systems programminglanguage.
     1038
     1039\begin{figure}
     1040\begin{cfa}
     1041enum Status { CONT, MSG, ESTX, ELNTH, ECRC };
     1042`coroutine` Driver {
     1043        Status status;
     1044        char * msg, byte;
     1045};
     1046void ?{}( Driver & d, char * m ) { d.msg = m; }         $\C[3.0in]{// constructor}$
     1047Status next( Driver & d, char b ) with( d ) {           $\C{// 'with' opens scope}$
     1048        byte = b; `resume( d );` return status;
     1049}
     1050void main( Driver & d ) with( d ) {
     1051        enum { STX = '\002', ESC = '\033', ETX = '\003', MaxMsg = 64 };
     1052        unsigned short int crc;                                                 $\C{// error checking}$
     1053  msg: for () {                                                                         $\C{// parse message}$
     1054                status = CONT;
     1055                unsigned int lnth = 0, sum = 0;
     1056                while ( byte != STX ) `suspend();`
     1057          emsg: for () {
     1058                        `suspend();`                                                    $\C{// process byte}$
     1059                        choose ( byte ) {                                               $\C{// switch with default break}$
     1060                          case STX:
     1061                                status = ESTX; `suspend();` continue msg;
     1062                          case ETX:
     1063                                break emsg;
     1064                          case ESC:
     1065                                suspend();
     1066                        } // choose
     1067                        if ( lnth >= MaxMsg ) {                                 $\C{// buffer full ?}$
     1068                                status = ELNTH; `suspend();` continue msg; }
     1069                        msg[lnth++] = byte;
     1070                        sum += byte;
     1071                } // for
     1072                msg[lnth] = '\0';                                                       $\C{// terminate string}\CRT$
     1073                `suspend();`
     1074                crc = (unsigned char)byte << 8; // prevent sign extension for signed char
     1075                `suspend();`
     1076                status = (crc | (unsigned char)byte) == sum ? MSG : ECRC;
     1077                `suspend();`
     1078        } // for
     1079}
     1080\end{cfa}
     1081\caption{Device driver for simple communication protocol}
     1082\end{figure}
     1083
    9471084
    9481085\subsection{Coroutine Implementation}
     
    10601197\end{cquote}
    10611198The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization.
     1199
     1200
     1201\section{Concurrency}
     1202\label{s:Concurrency}
     1203
     1204At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks.
     1205Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}.
     1206In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable.
     1207A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming;
     1208a \newterm{stackful} coroutine executes on its own stack, allowing full generality.
     1209Only stackful coroutines are a stepping stone to concurrency.
     1210
     1211The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}.
     1212Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next.
     1213The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}.
     1214
     1215Because the scheduler is special, it can either be a stackless or stackful coroutine.
     1216For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch.
     1217For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches.
     1218A stackful scheduler is often used for simplicity and security.
     1219
     1220Regardless of the approach used, a subset of concurrency related challenges start to appear.
     1221For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}.
     1222While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur.
     1223Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors.
     1224The reason is that only the runtime has complete knowledge about resources and how to best utilized them.
     1225However, the introduction of unrestricted non-determinism results in the need for \newterm{mutual exclusion} and \newterm{synchronization} to restrict non-determinism for correctness;
     1226otherwise, it is impossible to write meaningful programs.
     1227Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows.
     1228
     1229An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional.
     1230As such, library support for threading is far from widespread.
     1231At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}.
     1232In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism.
     1233As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages.
     1234Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay.
     1235Hence, concurrent programs should be written using high-level mechanisms, and only step down to lower-level mechanisms when performance bottlenecks are encountered.
    10621236
    10631237
Note: See TracChangeset for help on using the changeset viewer.