# Changeset 0161ddf for doc/papers/concurrency/Paper.tex

Ignore:
Timestamp:
Jun 4, 2019, 6:31:30 PM (2 years ago)
Branches:
arm-eh, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr
Children:
3c573e9
Parents:
7261e3c
Message:

rewrite introduction, add generator section, update coroutine section

File:
1 edited

### Legend:

Unmodified
 r7261e3c auto, _Bool, catch, catchResume, choose, _Complex, __complex, __complex__, __const, __const__, coroutine, disable, dtype, enable, exception, __extension__, fallthrough, fallthru, finally, __float80, float80, __float128, float128, forall, ftype, _Generic, _Imaginary, __imag, __imag__, __float80, float80, __float128, float128, forall, ftype, generator, _Generic, _Imaginary, __imag, __imag__, inline, __inline, __inline__, __int128, int128, __label__, monitor, mutex, _Noreturn, one_t, or, otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, thread, otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, suspend, thread, _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__, virtual, __volatile, __volatile__, waitfor, when, with, zero_t}, } \newbox\myboxA \newbox\myboxB \newbox\myboxC \newbox\myboxD \title{\texorpdfstring{Advanced Control-flow and Concurrency in \protect\CFA}{Advanced Control-flow in Cforall}} This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime. These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like C pthreads. \CFA introduces modern language-level control-flow mechanisms, like coroutines, user-level threading, and monitors for mutual exclusion and synchronization. \CFA introduces modern language-level control-flow mechanisms, like generators, coroutines, user-level threading, and monitors for mutual exclusion and synchronization. Library extension for executors, futures, and actors are built on these basic mechanisms. The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and reducing monitor barging. The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and optional monitor barging. The runtime also ensures multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), and this feature is fully integrated with all monitor synchronization mechanisms. All language features integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers. All control-flow features integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers. Experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages. }% \keywords{coroutines, concurrency, parallelism, threads, monitors, runtime, C, \CFA (Cforall)} \keywords{generator, coroutine, concurrency, parallelism, thread, monitor, runtime, C, \CFA (Cforall)} As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms. From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,BoostThreads}, including putting green threads back into Java~\cite{Quasar}. The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing smaller work-units to facilitate load balancing by the runtime~\cite{Verch12}. The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}. As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}. Finally, performant user-threading implementations (both time and space) are largely competitive with direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety. Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary. Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signalling-as-hints~\cite[\S~8]{Buhr05a}, where one begats the other. If you believe spurious wakeup is a foundational concurrency property, than unblocking (signalling) a thread is always a hint. If you \emph{do not} believe spurious wakeup is foundational, than signalling-as-hints is a performance decision. Most importantly, removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism. Clawing back performance where the local non-determinism is unimportant, should be an option not the default. Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signals-as-hints~\cite[\S~8]{Buhr05a}, where one is a consequence of the other, i.e., once there is spurious wakeup, signals-as-hints follows. However, spurious wakeup is \emph{not} a foundational concurrency property~\cite[\S~8]{Buhr05a}, it is a performance design choice. Similarly, signals-as-hints is also a performance decision. We argue removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism and matches with programmer expectation. (Experience by the authors teaching concurrency is that students are highly confused by these semantics.) Clawing back performance, when local non-determinism is unimportant, should be an option not the default. \begin{comment} \CFA embraces user-level threading, language extensions for advanced control-flow, and safety as the default. We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC and other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms. We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC, and other concurrent, imperative programming-languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms. The main contributions of this work are: \begin{itemize} \item expressive language-level coroutines and user-level threading, which respect the expectations of C programmers. language-level generators, coroutines and user-level threading, which respect the expectations of C programmers. \item monitor synchronization without barging. \item safely acquiring multiple monitors \emph{simultaneously} (deadlock free), while seamlessly integrating this capability with all monitor synchronization mechanisms. monitor synchronization without barging, and the ability to safely acquiring multiple monitors \emph{simultaneously} (deadlock free), while seamlessly integrating these capability with all monitor synchronization mechanisms. \item providing statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features. a runtime system with no spurious wakeup. \item experimental results showing comparable performance of the new features with similar mechanisms in other concurrent programming-languages. a dynamic partitioning mechanism to segregate the execution environment for specialized requirements. \item a non-blocking I/O library \item experimental results showing comparable performance of the new features with similar mechanisms in other programming languages. \end{itemize} \section{Stateful Function} The generator/coroutine provides a stateful function, which is an old idea~\cite{Conway63,Marlin80} that is new again~\cite{C++20Coroutine19}. A stateful function allows execution to be temporarily suspended and later resumed, e.g., plugin, device driver, finite-state machine. Hence, a stateful function may not end when it returns to its caller, allowing it to be restarted with the data and execution location present at the point of suspension. This capability is accomplished by retaining a data/execution \emph{closure} between invocations. If the closure is fixed size, we call it a \emph{generator} (or \emph{stackless}), and its control flow is restricted, e.g., suspending outside the generator is prohibited. If the closure is variable sized, we call it a \emph{coroutine} (or \emph{stackful}), and as the names implies, often implemented with a separate stack with no programming restrictions. Hence, refactoring a stackless coroutine may require changing it to stackful. A foundational property of all \emph{stateful functions} is that resume/suspend \emph{do not} cause incremental stack growth, i.e., resume/suspend operations are remembered through the closure not the stack. As well, activating a stateful function is \emph{asymmetric} or \emph{symmetric}, identified by resume/suspend (no cycles) and resume/resume (cycles). A fixed closure activated by modified call/return is faster than a variable closure activated by context switching. Additionally, any storage management for the closure (especially in unmanaged languages, i.e., no garbage collection) must also be factored into design and performance. Therefore, selecting between stackless and stackful semantics is a tradeoff between programming requirements and performance, where stackless is faster and stackful is more general. Note, creation cost is amortized across usage, so activation cost is usually the dominant factor. \begin{figure} \centering \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] typedef struct { int fn1, fn; } Fib; #define FibCtor { 1, 0 } int fib( Fib * f ) { int fn = f->fn; f->fn = f->fn1; f->fn1 = f->fn + fn; return fn; } int main() { Fib f1 = FibCtor, f2 = FibCtor; for ( int i = 0; i < 10; i += 1 ) printf( "%d %d\n", fib( &f1 ), fib( &f2 ) ); } \end{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] generator Fib { int fn1, fn; }; void main(Fib & fib) with(fib) { [fn1, fn] = [1, 0]; for () { suspend; [fn1, fn] = [fn, fn + fn1]; } } int main() { Fib f1, f2; for ( 10 ) sout | resume( f1 ).fn | resume( f2 ).fn; } \end{cfa} \end{lrbox} \begin{lrbox}{\myboxC} \begin{cfa}[aboveskip=0pt,belowskip=0pt] typedef struct { int fn1, fn;  void * next; } Fib; #define FibCtor { 1, 0, NULL } Fib * comain( Fib * f ) { if ( f->next ) goto *f->next; f->next = &&s1; for ( ;; ) { return f; s1:; int fn = f->fn + f->fn1; f->fn1 = f->fn; f->fn = fn; } } int main() { Fib f1 = FibCtor, f2 = FibCtor; for ( int i = 0; i < 10; i += 1 ) printf("%d %d\n",comain(&f1)->fn, comain(&f2)->fn); } \end{cfa} \end{lrbox} \subfloat[C asymmetric generator]{\label{f:CFibonacci}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[\CFA asymmetric generator]{\label{f:CFAFibonacciGen}\usebox\myboxB} \hspace{3pt} \vrule \hspace{3pt} \subfloat[C generator implementation]{\label{f:CFibonacciSim}\usebox\myboxC} \caption{Fibonacci (output) Asymmetric Generator} \label{f:FibonacciAsymmetricGenerator} \bigskip \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] generator Fmt { char ch; int g, b; }; void ?{}( Fmt & fmt ) { resume(fmt); } // constructor void ^?{}( Fmt & f ) with(f) { $\C[1.75in]{// destructor}$ if ( g != 0 || b != 0 ) sout | nl; } void main( Fmt & f ) with(f) { for () { $\C{// until destructor call}$ for ( ; g < 5; g += 1 ) { $\C{// groups}$ for ( ; b < 4; b += 1 ) { $\C{// blocks}$ suspend; $\C{// wait for character}$ while ( ch == '\n' ) suspend; // ignore sout | ch;                                              // newline } sout | " ";  // block spacer } sout | nl; // group newline } } int main() { Fmt fmt; $\C{// fmt constructor called}$ for () { sin | fmt.ch; $\C{// read into generator}$ if ( eof( sin ) ) break; resume( fmt ); } } $\C{// fmt destructor called}\CRT$ \end{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] typedef struct { void * next; char ch; int g, b; } Fmt; void comain( Fmt * f ) { if ( f->next ) goto *f->next; f->next = &&s1; for ( ;; ) { for ( f->g = 0; f->g < 5; f->g += 1 ) { for ( f->b = 0; f->b < 4; f->b += 1 ) { return; s1:;  while ( f->ch == '\n' ) return; printf( "%c", f->ch ); } printf( " " ); } printf( "\n" ); } } int main() { Fmt fmt = { NULL };  comain( &fmt ); // prime for ( ;; ) { scanf( "%c", &fmt.ch ); if ( feof( stdin ) ) break; comain( &fmt ); } if ( fmt.g != 0 || fmt.b != 0 ) printf( "\n" ); } \end{cfa} \end{lrbox} \subfloat[\CFA asymmetric generator]{\label{f:CFAFormatGen}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[C generator simulation]{\label{f:CFormatSim}\usebox\myboxB} \hspace{3pt} \caption{Formatter (input) Asymmetric Generator} \label{f:FormatterAsymmetricGenerator} \end{figure} \subsection{Generator} Stackless generators have the potential to be very small and fast, \ie as small and fast as function call/return for both creation and execution. The \CFA goal is to achieve this performance target, possibly at the cost of some semantic complexity. How this goal is accomplished is done through a series of different kinds of generators and their implementation. Figure~\ref{f:FibonacciAsymmetricGenerator} shows an unbounded asymmetric generator for an infinite sequence of Fibonacci numbers written in C and \CFA, with a simple C implementation for the \CFA version. This kind of generator is an \emph{output generator}, producing a new result on each resumption. To compute Fibonacci, the previous two values in the sequence are retained to generate the next value, \ie @fn1@ and @fn@, plus the execution location where control restarts when the generator is resumed, \ie top or middle. An additional requirement is the ability to create an arbitrary number of generators (of any kind), \ie retaining state in global variables is insufficient; hence, state is retained in a closure between calls. Figure~\ref{f:CFibonacci} shows the C approach of manually creating the closure in structure @Fib@, and multiple instances of this closure provide multiple Fibonacci generators. The C version only has the middle execution state because the top execution state becomes declaration initialization. Figure~\ref{f:CFAFibonacciGen} shows the \CFA approach, which also has a manual closure, but replaces the structure with a special \CFA @generator@ type. This generator type is then connected to a function named @main@ that takes as its only parameter a reference to the generator type, called a \emph{generator main}. The generator main contains @suspend@ statements that suspend execution without ending the generator versus @return@. For the Fibonacci generator-main,\footnote{ The \CFA \lstinline|with| opens an aggregate scope making its fields directly accessible, like Pascal \lstinline|with|, but using parallel semantics. Multiple aggregates may be opened.} the top initialization state appears at the start and the middle execution state is denoted by statement @suspend@. Any local variables in @main@ \emph{are not retained} between calls; hence local variable are only for temporary computations \emph{between} suspends. All retained state \emph{must} appear in the generator's type. As well, generator code containing a @suspend@ cannot be refactored into a helper function called by the generator, because @suspend@ is implemented via @return@, so a return from the helper function goes back to the current generator not the resumer. The generator is started by calling function @resume@ with a generator instance, which begins execution at the top of the generator main, and subsequent @resume@ calls restart the generator at its point of last suspension. Resuming an ended (returned) generator is undefined. Function @resume@ returns its argument generator so it can be cascaded in an expression, in this case to print the next Fibonacci value @fn@ computed in the generator instance. Figure~\ref{f:CFibonacciSim} shows the C implementation of the \CFA generator only needs one additional field, @next@, to handle retention of execution state. The computed @goto@ at the start of the generator main, which branches after the previous suspend, adds very little cost to the resume call. Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types. \begin{cfa} int ?()( Fib & fib ) with( fib ) { return resume( fib ).fn; }   // function-call interface int ?()( Fib & fib, int N ) with( fib ) { for ( N - 1 ) fib(); return fib(); }   // use simple interface double ?()( Fib & fib ) with( fib ) { return (int)fib() / 3.14159; } // cast prevents recursive call sout | (int)f1() | (double)f1() | f2( 2 );   // simple interface, cast selects call based on return type, step 2 values \end{cfa} Now, the generator can be a separately-compiled opaque-type only accessed through its interface functions. For contrast, Figure~\ref{f:PythonFibonacci} shows the equivalent Python Fibonacci generator, which does not use a generator type, and hence only has a single interface, but an implicit closure. Having to manually create the generator closure by moving local-state variables into the generator type is an additional programmer burden and possible point of error. (This restriction is removed by the coroutine, see Section~\ref{s:Coroutine}.) Our concern is that static analysis to discriminate local state from temporary variables is complex and beyond the current scope of the \CFA project. As well, supporting variable-size local-state, like variable-length arrays, requires dynamic allocation of the local state, which significantly increases the cost of generator creation/destruction, and is a show-stopper for embedded programming. But more importantly, the size of the generator type is tied to the local state in the generator main, which precludes separate compilation of the generator main, \ie a generator must be inlined or local state must be dynamically allocated. Our current experience is that most generator problems have simple data state, including local state, but complex execution state. As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators. Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored. \begin{center} \tt \begin{tabular}{@{}l|l@{}} \multicolumn{1}{c|}{\textbf{\textrm{input}}} & \multicolumn{1}{c}{\textbf{\textrm{output}}} \\ \begin{tabular}[t]{@{}ll@{}} abcdefghijklmnopqrstuvwxyz \\ abcdefghijklmnopqrstuvwxyz \end{tabular} & \begin{tabular}[t]{@{}lllll@{}} abcd    & efgh  & ijkl  & mnop  & qrst  \\ uvwx    & yzab  & cdef  & ghij  & klmn  \\ opqr    & stuv  & wxyz  &               & \end{tabular} \end{tabular} \end{center} The example takes advantage of resuming a generator in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops. The destructor provides a newline, if formatted text ends with a full line. Figure~\ref{f:CFormatSim} shows the C implementation of the \CFA input generator with one additional field and the computed @goto@. For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the Fibonacci generator. Figure~\ref{f:DeviceDriverGen} shows a \emph{killer} asymmetric generator, a device-driver, because device drivers caused 70\%-85\% of failures in Windows/Linux~\cite{Swift05}. Device drives follow the pattern of simple data state but complex execution state, \ie finite state-machine (FSM) parsing a protocol. For example, the following protocol: \begin{center} \ldots\, STX \ldots\, message \ldots\, ESC ETX \ldots\, message \ldots\, ETX 2-byte crc \ldots \end{center} is a network message beginning with the control character STX, ending with an ETX, and followed by a 2-byte cyclic-redundancy check. Control characters may appear in a message if preceded by an ESC. When a message byte arrives, it triggers an interrupt, and the operating system services the interrupt by calling the device driver with the byte read from a hardware register. The device driver returns a status code of its current state, and when a complete message is obtained, the operating system knows the message is in the message buffer. Hence, the device driver is an input/output generator. Note, the cost of creating and resuming the device-driver generator, @Driver@, is virtually identical to call/return, so performance in an operating-system kernel is excellent. As well, the data state is small, where variables @byte@ and @msg@ are communication variables for passing in message bytes and returning the message, and variables @lnth@, @crc@, and @sum@ are local variable that must be retained between calls and are hoisted into the generator type. Manually, detecting and hoisting local-state variables is easy when the number is small. Finally, the execution state is large, with one @resume@ and seven @suspend@s. Hence, the key benefits of the generator are correctness, safety, and maintenance because the execution states of the FSM are transcribed directly into the programming language rather than using a table-driven approach. Because FSMs can be complex and occur frequently in important domains, direct support of the generator is crucial in a systems programming-language. \begin{figure} \centering \newbox\myboxA \begin{lrbox}{\myboxA} \begin{python}[aboveskip=0pt,belowskip=0pt] def Fib(): fn1, fn = 0, 1 while True: yield fn1 fn1, fn = fn, fn1 + fn f1 = Fib() f2 = Fib() for i in range( 10 ): print( next( f1 ), next( f2 ) ) \end{python} \end{lrbox} \newbox\myboxB \begin{lrbox}{\myboxB} \begin{python}[aboveskip=0pt,belowskip=0pt] def Fmt(): try: while True: for g in range( 5 ): for b in range( 4 ): print( (yield), end='' ) print( '  ', end='' ) print() except GeneratorExit: if g != 0 | b != 0: print() fmt = Fmt() next( fmt )                    # prime, next prewritten for i in range( 41 ): fmt.send( 'a' );      # send to yield \end{python} \end{lrbox} \subfloat[Fibonacci]{\label{f:PythonFibonacci}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[Formatter]{\label{f:PythonFormatter}\usebox\myboxB} \caption{Python Generator} \label{f:PythonGenerator} \bigskip \begin{tabular}{@{}l|l@{}} \begin{cfa}[aboveskip=0pt,belowskip=0pt] enum Status { CONT, MSG, ESTX, ELNTH, ECRC }; generator Driver { Status status; unsigned char byte, * msg; // communication unsigned int lnth, sum;      // local state unsigned short int crc; }; void ?{}( Driver & d, char * m ) { d.msg = m; } Status next( Driver & d, char b ) with( d ) { byte = b; resume( d ); return status; } void main( Driver & d ) with( d ) { enum { STX = '\002', ESC = '\033', ETX = '\003', MaxMsg = 64 }; msg: for () { // parse message status = CONT; lnth = 0; sum = 0; while ( byte != STX ) suspend; emsg: for () { suspend; // process byte \end{cfa} & \begin{cfa}[aboveskip=0pt,belowskip=0pt] choose ( byte ) { // switch with implicit break case STX: status = ESTX; suspend; continue msg; case ETX: break emsg; case ESC: suspend; } if ( lnth >= MaxMsg ) { // buffer full ? status = ELNTH; suspend; continue msg; } msg[lnth++] = byte; sum += byte; } msg[lnth] = '\0'; // terminate string suspend; crc = byte << 8; suspend; status = (crc | byte) == sum ? MSG : ECRC; suspend; } } \end{cfa} \end{tabular} \caption{Device-driver generator for communication protocol} \label{f:DeviceDriverGen} \end{figure} Figure~\ref{f:CFAPingPongGen} shows a symmetric generator, where the generator resumes another generator, forming a resume/resume cycle. (The trivial cycle is a generator resuming itself.) This control flow is similar to recursion for functions, but without stack growth. The steps for symmetric control-flow are creating, executing, and terminating the cycle. Constructing the cycle must deal with definition-before-use to close the cycle, \ie, the first generator must know about the last generator, which is not within scope. (This issues occurs for any cyclic data-structure.) % The example creates all the generators and then assigns the partners that form the cycle. % Alternatively, the constructor can assign the partners as they are declared, except the first, and the first-generator partner is set after the last generator declaration to close the cycle. Once the cycle is formed, the program main resumes one of the generators, and the generators can then traverse an arbitrary cycle using @resume@ to activate partner generator(s). Terminating the cycle is accomplished by @suspend@ or @return@, both of which go back to the stack frame that started the cycle (program main in the example). The starting stack-frame is below the last active generator because the resume/resume cycle does not grow the stack. Also, since local variables are not retained in the generator function, it does not contain an objects with destructors that must be called, so the  cost is the same as a function return. Destructor cost occurs when the generator instance is deallocated, which is easily controlled by the programmer. Figure~\ref{f:CPingPongSim} shows the implementation of the symmetric generator, where the complexity is the @resume@, which needs an extension to the calling convention to perform a forward rather than backward jump. This jump starts at the top of the next generator main to re-execute the normal calling convention to make space on the stack for its local variables. However, before the jump, the caller must reset its stack (and any registers) equivalent to a @return@, but subsequently jump forward not backwards. This semantics is basically a tail-call optimization, which compilers already perform. Hence, assembly code is manually insert in the example to undo the generator's entry code before the direct jump. This assembly code is fragile as it depends on what entry code is generated, specifically if there are local variables and the level of optimization. To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time. Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing. A compiler could also eliminate other artifacts in the generator simulation to further increase performance. \begin{figure} \centering \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] generator PingPong { const char * name; PingPong & partner; // rebindable reference int N, i; }; void ?{}(PingPong & pp, char * nm, int N) with(pp) { name = nm;  partner = 0p;  pp.N = N;  i = 0; } void main( PingPong & pp ) with(pp) { for ( ; i < N; i += 1 ) { sout | name | i; resume( partner ); } } int main() { enum { N = 5 }; PingPong ping = {"ping", N}, pong = {"pong", N}; &ping.partner = &pong;  &pong.partner = &ping; resume( ping ); } \end{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{cfa}[escapechar={},aboveskip=0pt,belowskip=0pt] typedef struct PingPong { const char * name; struct PingPong * partner; int N, i; void * next; } PingPong; #define PPCtor(name, N) { name, NULL, N, 0, NULL } void comain( PingPong * pp ) { if ( pp->next ) goto *pp->next; pp->next = &&cycle; for ( ; pp->i < pp->N; pp->i += 1 ) { printf( "%s %d\n", pp->name, pp->i ); asm( "mov  %0,%%rdi" : "=m" (pp->partner) ); asm( "mov  %rdi,%rax" ); asm( "popq %rbx" ); asm( "jmp  comain" ); cycle: ; } } \end{cfa} \end{lrbox} \subfloat[\CFA symmetric generator]{\label{f:CFAPingPongGen}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[C generator simulation]{\label{f:CPingPongSim}\usebox\myboxB} \hspace{3pt} \caption{Ping-Pong Symmetric Generator} \label{f:PingPongSymmetricGenerator} \end{figure} Finally, part of this generator work was inspired by the recent \CCtwenty generator proposal~\cite{C++20Coroutine19} (which they call coroutines). Our work provides the same high-performance asymmetric-generators as \CCtwenty, and extends their work with symmetric generators. An additional \CCtwenty generator feature allows @suspend@ and @resume@ to be followed by a restricted compound-statement that is executed after the current generator has reset its stack but before calling the next generator, specified with the following \CFA syntax. \begin{cfa} ... suspend{ ... }; ... resume( C ){ ... } ... \end{cfa} Since the current generator's stack is released before calling the compound statement, the compound statement can only reference variables in the generator's type. This feature is useful when a generator is used in a concurrent context to ensure it is stopped before releasing a lock in the compound statement, which might immediately allow another thread to resume the generator. Hence, this mechanism provides a general and safe handoff of the generator among competing threads. \subsection{Coroutine} \label{s:Coroutine} Stackful coroutines extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main. A coroutine is specified by replacing @generator@ with @coroutine@ for the type. This generality results in higher cost for creation, due to dynamic stack allocation, execution, due to context switching among stacks, and terminating, due to possible stack unwinding and dynamic stack deallocation. How coroutines differ from generators is done through a series of examples. First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main. \begin{description} \item[Fibonacci] Move the declaration of @fn1@ to the start of coroutine main. \begin{cfa}[xleftmargin=0pt] void main( Fib & fib ) with(fib) { int fn1; \end{cfa} \item[Formatter] Move the declaration of @g@ and @b@ to the for loops in the coroutine main. \begin{cfa}[xleftmargin=0pt] for ( g; 5 ) { for ( b; 4 ) { \end{cfa} \item[Device Driver] Move the declaration of @lnth@ and @sum@ to their points of initialization. \begin{cfa}[xleftmargin=0pt] status = CONT; unsigned int lnth = 0, sum = 0; ... unsigned short int crc = byte << 8; \end{cfa} \item[PingPong] Move the declaration of @i@ to the for loop in the coroutine main. \begin{cfa}[xleftmargin=0pt] void main( PingPong & pp ) with(pp) { for ( i; N ) { \end{cfa} \end{description} It is also possible to refactor code containing local-state and @suspend@ statements into a helper routine, like the computation of the CRC for the device driver. \begin{cfa} unsigned int Crc() { suspend; unsigned short int crc = byte << 8; suspend; status = (crc | byte) == sum ? MSG : ECRC; return crc; } \end{cfa} A call to this function is placed at the end of the driver's coroutine-main. For complex finite-state machines, refactoring is part of normal program abstraction, especially when code is used in multiple places. Again, this complexity is usually associated with execution state rather than data state. \begin{comment} This paper provides a minimal concurrency \newterm{Application Program Interface} (API) that is simple, efficient and can be used to build other concurrency features. While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master. An easier approach for programmers is to support higher-level constructs as the basis of concurrency. Indeed, for highly-productive concurrent-programming, high-level approaches are much more popular~\cite{Hochstein05}. Examples of high-level approaches are jobs (thread pool)~\cite{TBB}, implicit threading~\cite{OpenMP}, monitors~\cite{Java}, channels~\cite{CSP,Go}, and message passing~\cite{Erlang,MPI}. The following terminology is used. A \newterm{thread} is a fundamental unit of execution that runs a sequence of code and requires a stack to maintain state. Multiple simultaneous threads give rise to \newterm{concurrency}, which requires locking to ensure access to shared data and safe communication. \newterm{Locking}, and by extension \newterm{locks}, are defined as a mechanism to prevent progress of threads to provide safety. \newterm{Parallelism} is running multiple threads simultaneously. Parallelism implies \emph{actual} simultaneous execution, where concurrency only requires \emph{apparent} simultaneous execution. As such, parallelism only affects performance, which is observed through differences in space and/or time at runtime. Hence, there are two problems to be solved: concurrency and parallelism. While these two concepts are often combined, they are distinct, requiring different tools~\cite[\S~2]{Buhr05a}. Concurrency tools handle mutual exclusion and synchronization, while parallelism tools handle performance, cost, and resource utilization. The proposed concurrency API is implemented in a dialect of C, called \CFA (pronounced C-for-all). The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantics, and type checking, and the corresponding high-performance runtime-library to implement the concurrent features. \end{comment} \begin{comment} \section{\CFA Overview} The following is a quick introduction to the \CFA language, specifically tailored to the features needed to support concurrency. Extended versions and explanation of the following code examples are available at the \CFA website~\cite{Cforall} or in Moss~\etal~\cite{Moss18}. \CFA is a non-object-oriented extension of ISO-C, and hence, supports all C paradigms. Like C, the building blocks of \CFA are structures and routines. Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions. While \CFA is not object oriented, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C has a notion of objects: region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}. While some object-oriented features appear in \CFA, they are independent capabilities, allowing \CFA to adopt them while maintaining a procedural paradigm. \subsection{References} \CFA provides multi-level rebindable references, as an alternative to pointers, which significantly reduces syntactic noise. \begin{cfa} int x = 1, y = 2, z = 3; int * p1 = &x, ** p2 = &p1,  *** p3 = &p2,      $\C{// pointers to x}$ & r1 = x,   && r2 = r1,   &&& r3 = r2;        $\C{// references to x}$ int * p4 = &z, & r4 = z; *p1 = 3; **p2 = 3; ***p3 = 3;       // change x r1 =  3;     r2 = 3;      r3 = 3;        // change x: implicit dereferences *r1, **r2, ***r3 **p3 = &y; *p3 = &p4;                // change p1, p2 &r3 = &y; &&r3 = &&r4;             // change r1, r2: cancel implicit dereferences (&*)**r3, (&(&*)*)*r3, &(&*)r4 \end{cfa} A reference is a handle to an object, like a pointer, but is automatically dereferenced the specified number of levels. Referencing (address-of @&@) a reference variable cancels one of the implicit dereferences, until there are no more implicit references, after which normal expression behaviour applies. \subsection{\texorpdfstring{\protect\lstinline{with} Statement}{with Statement}} \label{s:WithStatement} Heterogeneous data is aggregated into a structure/union. To reduce syntactic noise, \CFA provides a @with@ statement (see Pascal~\cite[\S~4.F]{Pascal}) to elide aggregate field-qualification by opening a scope containing the field identifiers. \begin{cquote} \vspace*{-\baselineskip}%??? \lstDeleteShortInline@% \begin{cfa} struct S { char c; int i; double d; }; struct T { double m, n; }; // multiple aggregate parameters \end{cfa} \begin{tabular}{@{}l@{\hspace{2\parindentlnth}}|@{\hspace{2\parindentlnth}}l@{}} \begin{cfa} void f( S & s, T & t ) { s.c; s.i; s.d; t.m; t.n; } \end{cfa} & \begin{cfa} void f( S & s, T & t ) with ( s, t ) { c; i; d;                // no qualification m; n; } \end{cfa} \end{tabular} \lstMakeShortInline@% \end{cquote} Object-oriented programming languages only provide implicit qualification for the receiver. In detail, the @with@-statement syntax is: \begin{cfa} $\emph{with-statement}$: 'with' '(' $\emph{expression-list}$ ')' $\emph{compound-statement}$ \end{cfa} and may appear as the body of a routine or nested within a routine body. Each expression in the expression-list provides a type and object. The type must be an aggregate type. (Enumerations are already opened.) The object is the implicit qualifier for the open structure-fields. All expressions in the expression list are opened in parallel within the compound statement, which is different from Pascal, which nests the openings from left to right. \subsection{Overloading} \CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem. Both variables and routines may be overloaded, where selection is based on number and types of returns and arguments (as in Ada~\cite{Ada}). \newpage \vspace*{-2\baselineskip}%??? \begin{cquote} \begin{cfa} // selection based on type \end{cfa} \lstDeleteShortInline@% \begin{tabular}{@{}l@{\hspace{2\parindentlnth}}|@{\hspace{2\parindentlnth}}l@{}} \begin{cfa} const short int MIN = -32768; const int MIN = -2147483648; const long int MIN = -9223372036854775808L; \end{cfa} & \begin{cfa} short int si = MIN; int i = MIN; long int li = MIN; \end{cfa} \end{tabular} \begin{cfa} // selection based on type and number of parameters \end{cfa} \begin{tabular}{@{}l@{\hspace{2.7\parindentlnth}}|@{\hspace{2\parindentlnth}}l@{}} \begin{cfa} void f( void ); void f( char ); void f( int, double ); \end{cfa} & \begin{cfa} f(); f( 'a' ); f( 3, 5.2 ); \end{cfa} \end{tabular} \begin{cfa} // selection based on type and number of returns \end{cfa} \begin{tabular}{@{}l@{\hspace{2\parindentlnth}}|@{\hspace{2\parindentlnth}}l@{}} \begin{cfa} char f( int ); double f( int ); [char, double] f( int ); \end{cfa} & \begin{cfa} char c = f( 3 ); double d = f( 3 ); [d, c] = f( 3 ); \end{cfa} \end{tabular} \lstMakeShortInline@% \end{cquote} Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects. Therefore, overloading eliminates long prefixes and other naming conventions to prevent name clashes. As seen in Section~\ref{s:Concurrency}, routine @main@ is heavily overloaded. As another example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name: \begin{cfa} struct S { int i; int j; double m; } s; struct T { int i; int k; int m; } t; with ( s, t ) { j + k;                                                                  $\C{// unambiguous, s.j + t.k}$ m = 5.0;                                                                $\C{// unambiguous, s.m = 5.0}$ m = 1;                                                                  $\C{// unambiguous, t.m = 1}$ int a = m;                                                              $\C{// unambiguous, a = t.m }$ double b = m;                                                   $\C{// unambiguous, b = s.m}$ int c = s.i + t.i;                                  $\C{// unambiguous, qualification}$ (double)m;                                                              $\C{// unambiguous, cast s.m}$ } \end{cfa} For parallel semantics, both @s.i@ and @t.i@ are visible with the same type, so only @i@ is ambiguous without qualification. \subsection{Operators} Overloading also extends to operators. Operator-overloading syntax creates a routine name with an operator symbol and question marks for the operands: \begin{cquote} \lstDeleteShortInline@% \begin{tabular}{@{}ll@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}} \begin{cfa} int ++?(int op); int ?++(int op); int ?+?(int op1, int op2); int ?<=?(int op1, int op2); int ?=? (int & op1, int op2); int ?+=?(int & op1, int op2); \end{cfa} & \begin{cfa} // unary prefix increment // unary postfix increment // binary plus // binary less than // binary assignment // binary plus-assignment \end{cfa} & \begin{cfa} struct S { int i, j; }; S ?+?( S op1, S op2) { // add two structures return (S){op1.i + op2.i, op1.j + op2.j}; } S s1 = {1, 2}, s2 = {2, 3}, s3; s3 = s1 + s2;         // compute sum: s3 == {2, 5} \end{cfa} \end{tabular} \lstMakeShortInline@% \end{cquote} \subsection{Constructors / Destructors} Object lifetime is a challenge in non-managed programming languages. \CFA responds with \CC-like constructors and destructors, using a different operator-overloading syntax. \begin{cfa} struct VLA { int len, * data; };                        $\C{// variable length array of integers}$ void ?{}( VLA & vla ) with ( vla ) { len = 10;  data = alloc( len ); }  $\C{// default constructor}$ void ?{}( VLA & vla, int size, char fill ) with ( vla ) { len = size;  data = alloc( len, fill ); } // initialization void ?{}( VLA & vla, VLA other ) { vla.len = other.len;  vla.data = other.data; } $\C{// copy, shallow}$ void ^?{}( VLA & vla ) with ( vla ) { free( data ); } $\C{// destructor}$ { VLA  x,            y = { 20, 0x01 },     z = y; $\C{// z points to y}$ // $\LstCommentStyle{\color{red}\ \ \ x\{\};\ \ \ \ \ \ \ \ \ y\{ 20, 0x01 \};\ \ \ \ \ \ \ \ \ \ z\{ z, y \};\ \ \ \ \ \ \ implicit calls}$ ^x{};                                                                   $\C{// deallocate x}$ x{};                                                                    $\C{// reallocate x}$ z{ 5, 0xff };                                                   $\C{// reallocate z, not pointing to y}$ ^y{};                                                                   $\C{// deallocate y}$ y{ x };                                                                 $\C{// reallocate y, points to x}$ x{};                                                                    $\C{// reallocate x, not pointing to y}$ }       //  $\LstCommentStyle{\color{red}\^{}z\{\};\ \ \^{}y\{\};\ \ \^{}x\{\};\ \ \ implicit calls}$ \end{cfa} Like \CC, construction is implicit on allocation (stack/heap) and destruction is implicit on deallocation. The object and all their fields are constructed/destructed. \CFA also provides @new@ and @delete@ as library routines, which behave like @malloc@ and @free@, in addition to constructing and destructing objects: \begin{cfa} { ... struct S s = {10}; ...                              $\C{// allocation, call constructor}$ }                                                                                       $\C{// deallocation, call destructor}$ struct S * s = new();                                           $\C{// allocation, call constructor}$ ... delete( s );                                                            $\C{// deallocation, call destructor}$ \end{cfa} \CFA concurrency uses object lifetime as a means of mutual exclusion and/or synchronization. \subsection{Parametric Polymorphism} \label{s:ParametricPolymorphism} The signature feature of \CFA is parametric-polymorphic routines~\cite{Cforall} with routines generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types. For example, the following sum routine works for any type that supports construction from 0 and addition: \begin{cfa} forall( otype T | { void ?{}( T *, zero_t ); T ?+?( T, T ); } ) // constraint type, 0 and + T sum( T a[$\,$], size_t size ) { T total = { 0 };                                    $\C{// initialize by 0 constructor}$ for ( size_t i = 0; i < size; i += 1 ) total = total + a[i];                         $\C{// select appropriate +}$ return total; } S sa[5]; int i = sum( sa, 5 );                                           $\C{// use S's 0 construction and +}$ \end{cfa} Type variables can be @otype@ or @dtype@. @otype@ refers to a \emph{complete type}, \ie, a type with size, alignment, default constructor, copy constructor, destructor, and assignment operator. @dtype@ refers to an \emph{incomplete type}, \eg, void or a forward-declared type. The builtin types @zero_t@ and @one_t@ overload constant 0 and 1 for a new types, where both 0 and 1 have special meaning in C. \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each routine declaration: \begin{cfa} trait sumable( otype T ) { void ?{}( T &, zero_t );                              $\C{// 0 literal constructor}$ T ?+?( T, T );                                                $\C{// assortment of additions}$ T ?+=?( T &, T ); T ++?( T & ); T ?++( T & ); }; forall( otype T | sumable( T ) )                      $\C{// use trait}$ T sum( T a[$\,$], size_t size ); \end{cfa} Using the return type for overload discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@: \begin{cfa} forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); } int * ip = alloc();                                                     $\C{// select type and size from left-hand side}$ double * dp = alloc(); struct S {...} * sp = alloc(); \end{cfa} where the return type supplies the type/size of the allocation, which is impossible in most type systems. \end{comment} \section{Coroutines: Stepping Stone} \label{coroutine} Coroutines are generalized routines allowing execution to be temporarily suspended and later resumed. Hence, unlike a normal routine, a coroutine may not terminate when it returns to its caller, allowing it to be restarted with the values and execution location present at the point of suspension. This capability is accomplished via the coroutine's stack, where suspend/resume context switch among stacks. Because threading design-challenges are present in coroutines, their design effort is relevant, and this effort can be easily exposed to programmers giving them a useful new programming paradigm because a coroutine handles the class of problems that need to retain state between calls, \eg plugins, device drivers, and finite-state machines. Therefore, the two fundamental features of the core \CFA coroutine-API are independent call-stacks and @suspend@/@resume@ operations. For example, a problem made easier with coroutines is unbounded generators, \eg generating an infinite sequence of Fibonacci numbers \begin{displaymath} \mathsf{fib}(n) = \left \{ \begin{array}{ll} 0                                       & n = 0         \\ 1                                       & n = 1         \\ \mathsf{fib}(n-1) + \mathsf{fib}(n-2)   & n \ge 2       \\ \end{array} \right. \end{displaymath} where Figure~\ref{f:C-fibonacci} shows conventional approaches for writing a Fibonacci generator in C. Figure~\ref{f:GlobalVariables} illustrates the following problems: unique unencapsulated global variables necessary to retain state between calls, only one Fibonacci generator, and execution state must be explicitly retained via explicit state variables. Figure~\ref{f:ExternalState} addresses these issues: unencapsulated program global variables become encapsulated structure variables, unique global variables are replaced by multiple Fibonacci objects, and explicit execution state is removed by precomputing the first two Fibonacci numbers and returning $\mathsf{fib}(n-2)$. Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @coroutine Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@. Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main. The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code represents the three states in the Fibonacci formula via the three suspend points, to context switch back to the caller's @resume@. The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@; on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. The first @resume@ is special because it allocates the coroutine stack and cocalls its coroutine main on that stack; when the coroutine main returns, its stack is deallocated. Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes. Figure~\ref{f:Coroutine1State} shows the coroutine version of the C version in Figure~\ref{f:ExternalState}. Coroutine generators are called \newterm{output coroutines} because values are only returned. \begin{figure} \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] #define FIB_INIT { 0, 1 } #define FibCtor { 0, 1 } typedef struct { int fn1, fn; } Fib; int fib( Fib * f ) { int main() { Fib f1 = FIB_INIT, f2 = FIB_INIT; Fib f1 = FibCtor, f2 = FibCtor; for ( int i = 0; i < 10; i += 1 ) { printf( "%d %d\n", [fn1, fn] = [0, 1]; for () { suspend(); suspend; [fn1, fn] = [fn, fn1 + fn]; } } int ?()( Fib & fib ) with( fib ) { resume( fib );  return fn1; return resume( fib ).fn1; } int main() { \caption{Fibonacci Generator} \label{f:C-fibonacci} % \bigskip % % \newbox\myboxA % \begin{lrbox}{\myboxA} % \begin{cfa}[aboveskip=0pt,belowskip=0pt] % coroutine Fib { int fn; }; % void main( Fib & fib ) with( fib ) { %       fn = 0;  int fn1 = fn; suspend(); %       fn = 1;  int fn2 = fn1;  fn1 = fn; suspend(); %       for () { %               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; suspend(); } % } % int next( Fib & fib ) with( fib ) { resume( fib ); return fn; } % int main() { %       Fib f1, f2; %       for ( 10 ) %               sout | next( f1 ) | next( f2 ); % } % \end{cfa} % \end{lrbox} % \newbox\myboxB % \begin{lrbox}{\myboxB} % \begin{python}[aboveskip=0pt,belowskip=0pt] % % def Fibonacci(): %       fn = 0; fn1 = fn; yield fn  # suspend %       fn = 1; fn2 = fn1; fn1 = fn; yield fn %       while True: %               fn = fn1 + fn2; fn2 = fn1; fn1 = fn; yield fn % % % f1 = Fibonacci() % f2 = Fibonacci() % for i in range( 10 ): %       print( next( f1 ), next( f2 ) ) # resume % % \end{python} % \end{lrbox} % \subfloat[\CFA]{\label{f:Coroutine3States}\usebox\myboxA} % \qquad % \subfloat[Python]{\label{f:Coroutine1State}\usebox\myboxB} % \caption{Fibonacci input coroutine, 3 states, internal variables} % \label{f:cfa-fibonacci} \end{figure} Using a coroutine, it is possible to express the Fibonacci formula directly without any of the C problems. Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @coroutine Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@. Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main. The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code represents the three states in the Fibonacci formula via the three suspend points, to context switch back to the caller's @resume@. The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@; on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. The first @resume@ is special because it allocates the coroutine stack and cocalls its coroutine main on that stack; when the coroutine main returns, its stack is deallocated. Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes. Figure~\ref{f:Coroutine1State} shows the coroutine version of the C version in Figure~\ref{f:ExternalState}. Coroutine generators are called \newterm{output coroutines} because values are only returned. Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of characters of fixed-size blocks. For example, the input of the left is reformatted into the output on the right. \begin{quote} \tt \begin{tabular}{@{}l|l@{}} \multicolumn{1}{c|}{\textbf{\textrm{input}}} & \multicolumn{1}{c}{\textbf{\textrm{output}}} \\ abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz & \begin{tabular}[t]{@{}lllll@{}} abcd    & efgh  & ijkl  & mnop  & qrst  \\ uvwx    & yzab  & cdef  & ghij  & klmn  \\ opqr    & stuv  & wxyz  &               & \end{tabular} \end{tabular} \end{quote} The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops. The destructor provides a newline, if formatted text ends with a full line. Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flattened (linearized) and rechecked on each call because execution location is not retained between calls. (Linearized code is the bane of device drivers.) \begin{figure} \centering \bigskip \newbox\myboxA \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] coroutine Fmt { char ch;   // communication variables int g, b;   // needed in destructor }; void main( Fmt & fmt ) with( fmt ) { coroutine Fib { int fn; }; void main( Fib & fib ) with( fib ) { fn = 0;  int fn1 = fn; suspend; fn = 1;  int fn2 = fn1;  fn1 = fn; suspend; for () { for ( g = 0; g < 5; g += 1 ) { // groups for ( b = 0; b < 4; b += 1 ) { // blocks suspend(); sout | ch; } // print character sout | "  "; } // block separator sout | nl; }  // group separator } void ?{}( Fmt & fmt ) { resume( fmt ); } // prime void ^?{}( Fmt & fmt ) with( fmt ) { // destructor if ( g != 0 || b != 0 ) // special case sout | nl; } void send( Fmt & fmt, char c ) { fmt.ch = c; resume( fmt ); } fn = fn1 + fn2; fn2 = fn1; fn1 = fn; suspend; } } int next( Fib & fib ) with( fib ) { resume( fib ); return fn; } int main() { Fmt fmt; sout | nlOff;   // turn off auto newline for ( 41 ) send( fmt, 'a' ); Fib f1, f2; for ( 10 ) sout | next( f1 ) | next( f2 ); } \end{cfa} \end{lrbox} \newbox\myboxB \begin{lrbox}{\myboxB} \begin{python}[aboveskip=0pt,belowskip=0pt] def Fmt(): try: while True: for g in range( 5 ): for b in range( 4 ): print( (yield), end='' ) print( '  ', end='' ) print() except GeneratorExit: if g != 0 | b != 0: print() fmt = Fmt() next( fmt )                    # prime for i in range( 41 ): fmt.send( 'a' );      # send to yield def Fibonacci(): fn = 0; fn1 = fn; yield fn  # suspend fn = 1; fn2 = fn1; fn1 = fn; yield fn while True: fn = fn1 + fn2; fn2 = fn1; fn1 = fn; yield fn f1 = Fibonacci() f2 = Fibonacci() for i in range( 10 ): print( next( f1 ), next( f2 ) ) # resume \end{python} \end{lrbox} \subfloat[\CFA]{\label{f:CFAFmt}\usebox\myboxA} \subfloat[\CFA]{\label{f:Coroutine3States}\usebox\myboxA} \qquad \subfloat[Python]{\label{f:CFmt}\usebox\myboxB} \caption{Output formatting text} \label{f:fmt-line} \subfloat[Python]{\label{f:Coroutine1State}\usebox\myboxB} \caption{Fibonacci input coroutine, 3 states, internal variables} \label{f:cfa-fibonacci} \end{figure} The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines. However, @resume@ and @suspend@ context switch among existing stack-frames, rather than create new ones so there is no stack growth. \newterm{Symmetric (full) coroutine}s have a coroutine call to a resuming routine for another coroutine, and its coroutine main calls another resuming routine, which eventually forms a resuming-call cycle. (The trivial cycle is a coroutine resuming itself.) This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch. \end{comment} \begin{figure} \begin{cfa} coroutine Prod { Cons & c; Cons & c;                       // communication int N, money, receipt; }; } void start( Prod & prod, int N, Cons &c ) { &prod.c = &c; // reassignable reference &prod.c = &c; prod.[N, receipt] = [N, 0]; resume( prod ); \begin{cfa} coroutine Cons { Prod & p; Prod & p;                       // communication int p1, p2, status; bool done; cons.[status, done ] = [0, false]; } void ^?{}( Cons & cons ) {} void main( Cons & cons ) with( cons ) { // 1st resume starts here resume( cons ); } \end{cfa} \end{tabular} \caption{Producer / consumer: resume-resume cycle, bi-directional communication} \label{f:ProdCons} \medskip \begin{center} \input{FullProdConsStack.pstex_t} \end{center} \vspace*{-10pt} \caption{Producer / consumer runtime stacks} \label{f:ProdConsRuntimeStacks} \medskip \begin{center} \input{FullCoroutinePhases.pstex_t} \end{center} \vspace*{-10pt} \caption{Ping / Pong coroutine steps} \label{f:PingPongFullCoroutineSteps} \end{figure} Figure~\ref{f:ProdCons} shows a producer/consumer symmetric-coroutine performing bi-directional communication. Since the solution involves a full-coroutining cycle, the program main creates one coroutine in isolation, passes this coroutine to its partner, and closes the cycle at the call to @start@. The @start@ routine communicates both the number of elements to be produced and the consumer into the producer's coroutine-structure. Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it. @prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer. Figure~\ref{f:ProdCons} shows the ping-pong example in Figure~\ref{f:CFAPingPongGen} extended into a producer/consumer symmetric-coroutine performing bidirectional communication. This example is illustrative because both producer/consumer have two interface functions with @resume@s that suspend execution in these interface (helper) functions. The program main creates the producer coroutine, passes it to the consumer coroutine in its initialization, and closes the cycle at the call to @start@ along with the number of items to be produced. The first @resume@ of @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it. @prod@'s coroutine main starts, creates local-state variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer. The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status. For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine. On the first resume, @cons@'s stack is created and initialized, holding local-state variables retained between subsequent activations of the coroutine. The consumer iterates until the @done@ flag is set, prints the values delivered by the producer, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation). The call from the consumer to @payment@ introduces the cycle between producer and consumer. When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed. The context switch restarts the producer at the point where it last context switched, so it continues in @delivery@ after the resume. @delivery@ returns the status value in @prod@'s coroutine main, where the status is printed. The loop then repeats calling @delivery@, where each call resumes the consumer coroutine. The consumer increments and returns the receipt to the call in @cons@'s coroutine main. The loop then repeats calling @payment@, where each call resumes the producer coroutine. After iterating $N$ times, the producer calls @stop@. The @done@ flag is set to stop the consumer's execution and a resume is executed. The context switch restarts @cons@ in @payment@ and it returns with the last receipt. The consumer terminates its loops because @done@ is true, its @main@ terminates, so @cons@ transitions from a coroutine back to an object, and @prod@ reactivates after the resume in @stop@. @stop@ returns and @prod@'s coroutine main terminates. The program main restarts after the resume in @start@. @start@ returns and the program main terminates. One \emph{killer} application for a coroutine is device drivers, which at one time caused 70\%-85\% of failures in Windows/Linux~\cite{Swift05}. Many device drivers are a finite state-machine parsing a protocol, e.g.: \begin{tabbing} \ldots STX \= \ldots message \ldots \= ESC \= ETX \= \ldots message \ldots  \= ETX \= 2-byte crc \= \ldots      \kill \ldots STX \> \ldots message \ldots \> ESC \> ETX \> \ldots message \ldots  \> ETX \> 2-byte crc \> \ldots \end{tabbing} where a network message begins with the control character STX and ends with an ETX, followed by a 2-byte cyclic-redundancy check. Control characters may appear in a message if preceded by an ESC. Because FSMs can be complex and occur frequently in important domains, direct support of the coroutine is crucial in a systems programminglanguage. \begin{figure} \begin{cfa} enum Status { CONT, MSG, ESTX, ELNTH, ECRC }; coroutine Driver { Status status; char * msg, byte; }; void ?{}( Driver & d, char * m ) { d.msg = m; }         $\C[3.0in]{// constructor}$ Status next( Driver & d, char b ) with( d ) {           $\C{// 'with' opens scope}$ byte = b; resume( d ); return status; } void main( Driver & d ) with( d ) { enum { STX = '\002', ESC = '\033', ETX = '\003', MaxMsg = 64 }; unsigned short int crc;                                                 $\C{// error checking}$ msg: for () {                                                                         $\C{// parse message}$ status = CONT; unsigned int lnth = 0, sum = 0; while ( byte != STX ) suspend(); emsg: for () { suspend();                                                    $\C{// process byte}$ choose ( byte ) {                                               $\C{// switch with default break}$ case STX: status = ESTX; suspend(); continue msg; case ETX: break emsg; case ESC: suspend(); } // choose if ( lnth >= MaxMsg ) {                                 $\C{// buffer full ?}$ status = ELNTH; suspend(); continue msg; } msg[lnth++] = byte; sum += byte; } // for msg[lnth] = '\0';                                                       $\C{// terminate string}\CRT$ suspend(); crc = (unsigned char)byte << 8; // prevent sign extension for signed char suspend(); status = (crc | (unsigned char)byte) == sum ? MSG : ECRC; suspend(); } // for } \end{cfa} \caption{Device driver for simple communication protocol} \end{figure} Figure~\ref{f:ProdConsRuntimeStacks} shows the runtime stacks of the program main, and the coroutine mains for @prod@ and @cons@ during the cycling. Terminating a coroutine cycle is more complex than a generator cycle, because it requires context switching to the program main's \emph{stack} to shutdown the program, whereas generators started by the program main run on its stack. Furthermore, each deallocated coroutine must guarantee all destructors are run for object allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep. When a coroutine's main ends, its stack is already unwound so any stack allocated objects with destructors have been finalized. The na\"{i}ve semantics for coroutine-cycle termination is context switch to the last resumer, like executing a @suspend@/@return@ in a generator. However, for coroutines, the last resumer is \emph{not} implicitly below the current stack frame, as for generators, because each coroutine's stack is independent. Unfortunately, it is impossible to determine statically if a coroutine is in a cycle and unrealistic to check dynamically (graph-cycle problem). Hence, a compromise solution is necessary that works for asymmetric (acyclic) and symmetric (cyclic) coroutines. Our solution for coroutine termination works well for the most common asymmetric and symmetric coroutine usage-patterns. For asymmetric coroutines, it is common for the first resumer (starter) coroutine to be the only resumer. All previous generators converted to coroutines have this property. For symmetric coroutines, it is common for the cycle creator to persist for the life-time of the cycle. Hence, the starter coroutine is remembered on the first resume and ending the coroutine resumes the starter. Figure~\ref{f:ProdConsRuntimeStacks} shows this semantic by the dashed lines from the end of the coroutine mains: @prod@ starts @cons@ so @cons@ resumes @prod@ at the end, and the program main starts @prod@ so @prod@ resumes the program main at the end. For other scenarios, it is always possible to devise a solution with additional programming effort. The producer/consumer example does not illustrate the full power of the starter semantics because @cons@ always ends first. Assume generator @PingPong@ is converted to a coroutine. Figure~\ref{f:PingPongFullCoroutineSteps} shows the creation, starter, and cyclic execution steps of the coroutine version. The program main creates (declares) coroutine instances @ping@ and @pong@. Next, program main resumes @ping@, making it @ping@'s starter, and @ping@'s main resumes @pong@'s main, making it @pong@'s starter. Execution forms a cycle when @pong@ resumes @ping@, and cycles $N$ times. By adjusting $N$ for either @ping@/@pong@, it is possible to have either one finish first, instead of @pong@ always ending first. If @pong@ ends first, it resumes its starter @ping@ in its coroutine main, then @ping@ ends and resumes its starter the program main in function @start@. If @ping@ ends first, it resumes its starter the program main in function @start@. Regardless of the cycle complexity, the starter stack always leads back to the program main, but the stack can be entered at an arbitrary point. Once back at the program main, coroutines @ping@ and @pong@ are deallocated. For generators, deallocation runs the destructors for all objects in the generator type. For coroutines, deallocation deals with objects in the coroutine type and must also run the destructors for any objects pending on the coroutine's stack for any unterminated coroutine. Hence, if a coroutine's destructor detects the coroutine is not ended, it implicitly raises a cancellation exception (uncatchable exception) at the coroutine and resumes it so the cancellation exception can propagate to the root of the coroutine's stack destroying all local variable on the stack. So the \CFA semantics for the generator and coroutine, ensure both can be safely deallocated at any time, regardless of their current state, like any other aggregate object. Explicitly raising normal exceptions at another coroutine can replace flag variables, like @stop@, \eg @prod@ raises a @stop@ exception at @cons@ after it finishes generating values and resumes @cons@, which catches the @stop@exception to terminate its loop. Finally, there is an interesting effect for @suspend@ with symmetric coroutines. A coroutine must retain its last resumer to suspend back because the resumer is on a different stack. These reverse pointers allow @suspend@ to cycle \emph{backwards}, which may be useful in certain cases. However, there is an anomaly if a coroutine resumes itself, because it overwrites its last resumer with itself, losing the ability to resume the last external resumer. To prevent losing this information, a self-resume does not overwrite the last resumer. A significant implementation challenge for coroutines (and threads, see Section~\ref{threads}) is adding extra fields and executing code after/before the coroutine constructor/destructor and coroutine main to create/initialize/de-initialize/destroy extra fields and the stack. There are several solutions to this problem and the chosen option forced the \CFA coroutine design. Object-oriented inheritance provides extra fields and code in a restricted context, but it requires programmers to explicitly perform the inheritance: There are several solutions to this problem and the chosen option directed the \CFA coroutine design. For object-oriented languages, inheritance can be used to provide extra fields and code, but it requires explicitly writing the inheritance: \begin{cfa}[morekeywords={class,inherits}] class mycoroutine inherits baseCoroutine { ... } \end{cfa} and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack. In addition, the programming language and possibly its tool set, \eg debugger, @valgrind@ need to understand @baseCoroutine@ because of special stack property of coroutines. Furthermore, the execution of constructors/destructors is in the wrong order for certain operations. For example, for threads if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. } \end{cfa} which also requires an explicit declaration that must be the last one to ensure correct initialization order. However, there is nothing preventing wrong placement or multiple declarations. For coroutines as for threads, many implementations are based on routine pointers or routine objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. which also requires an explicit declaration that must be last to ensure correct initialization order. However, there is nothing preventing wrong placement or multiple declarations of @baseCoroutine@. For coroutines, as for threads, many implementations are based on function pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. For example, Boost implements coroutines in terms of four functor object-types: \begin{cfa} symmetric_coroutine<>::yield_type \end{cfa} Similarly, the canonical threading paradigm is often based on routine pointers, \eg @pthreads@~\cite{Butenhof97}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}. However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type. Figure~\ref{f:CoroutineMemoryLayout} shows different memory-layout options for a coroutine (where a task is similar). The coroutine handle is the @coroutine@ instance containing programmer specified global/communication variables. The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate.\footnote{ We are examining variable-sized structures (VLS), where fields can be variable-sized structures or arrays. Once allocated, a VLS is fixed sized.} The coroutine stack can appear in a number of locations and forms, fixed or variable sized. Hence, the coroutine's stack could be a VLS on the allocating stack, provided the allocating stack is large enough. For stack allocation, allocation/deallocation only requires a few arithmetic operation to compute the size and adjust the stack point, modulo any constructor costs. For heap allocation, allocation/deallocation is an expensive heap operation (where the heap can be a shared resource), modulo any constructor costs. For heap stack allocation, it is also possible to use a split (segmented) stack calling-convention, available with gcc and clang, so the stack is variable sized. Currently, \CFA supports stack/heap allocated descriptors but only heap allocated stacks; split-stack allocation is under development. In \CFA debug-mode, a fixed-sized stack is terminated with a write-only page, which catches most stack overflows. \begin{figure} \centering \input{corlayout.pstex_t} \caption{Coroutine memory layout} \label{f:CoroutineMemoryLayout} \end{figure} Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthreads@~\cite{Butenhof97}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}. However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type, as in the pthreads approach. \begin{cfa} void mycor( coroutine_t cid, void * arg ) { } \end{cfa} Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little. Since the custom type is simple to write in \CFA and solves several issues, added support for function/lambda-based coroutines adds very little. Note, the type @coroutine_t@ must be an abstract handle to the coroutine, because the coroutine descriptor and its stack are non-copyable.