# Changeset d7a02ae

Ignore:
Timestamp:
Jun 13, 2019, 8:27:28 AM (3 years ago)
Branches:
arm-eh, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
d60780c
Parents:
6625727
Message:

first complete draft of new concurrency paper

Location:
doc/papers/concurrency
Files:
5 edited
3 moved

Unmodified
Removed
• ## doc/papers/concurrency/Paper.tex

 r6625727 _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__, virtual, __volatile, __volatile__, waitfor, when, with, zero_t}, moredirectives={defined,include_next}% moredirectives={defined,include_next}, % replace/adjust listing characters that look bad in sanserif literate={-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.8ex}{0.1ex}}}}1 {^}{\raisebox{0.6ex}{$\scriptstyle\land\,$}}1 {~}{\raisebox{0.3ex}{$\scriptstyle\sim\,$}}1 % {}{\ttfamily\upshape\hspace*{-0.1ex}}1 {<}{\textrm{\textless}}1 {>}{\textrm{\textgreater}}1 {<-}{$\leftarrow$}2 {=>}{$\Rightarrow$}2 {->}{\makebox[1ex][c]{\raisebox{0.5ex}{\rule{0.8ex}{0.075ex}}}\kern-0.2ex{\textrm{\textgreater}}}2, } aboveskip=4pt,                                                                                  % spacing above/below code block belowskip=3pt, % replace/adjust listing characters that look bad in sanserif literate={-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.8ex}{0.1ex}}}}1 {^}{\raisebox{0.6ex}{$\scriptstyle\land\,$}}1 {~}{\raisebox{0.3ex}{$\scriptstyle\sim\,$}}1 % {}{\ttfamily\upshape\hspace*{-0.1ex}}1 {<}{\textrm{\textless}}1 {>}{\textrm{\textgreater}}1 {<-}{$\leftarrow$}2 {=>}{$\Rightarrow$}2 {->}{\makebox[1ex][c]{\raisebox{0.5ex}{\rule{0.8ex}{0.075ex}}}\kern-0.2ex{\textrm{\textgreater}}}2, moredelim=**[is][\color{red}]{}{}, }% lstset } % Go programming language: https://github.com/julienc91/listings-golang/blob/master/listings-golang.sty \lstdefinelanguage{Golang}{ morekeywords=[1]{package,import,func,type,struct,return,defer,panic,recover,select,var,const,iota,}, morekeywords=[2]{string,uint,uint8,uint16,uint32,uint64,int,int8,int16,int32,int64, bool,float32,float64,complex64,complex128,byte,rune,uintptr, error,interface}, morekeywords=[3]{map,slice,make,new,nil,len,cap,copy,close,true,false,delete,append,real,imag,complex,chan,}, morekeywords=[4]{for,break,continue,range,goto,switch,case,fallthrough,if,else,default,}, morekeywords=[5]{Println,Printf,Error,}, sensitive=true, morecomment=[l]{//}, morecomment=[s]{/*}{*/}, morestring=[b]', morestring=[b]", morestring=[s]{}{}, % replace/adjust listing characters that look bad in sanserif literate={-}{\makebox[1ex][c]{\raisebox{0.4ex}{\rule{0.8ex}{0.1ex}}}}1 {^}{\raisebox{0.6ex}{$\scriptstyle\land\,$}}1 {~}{\raisebox{0.3ex}{$\scriptstyle\sim\,$}}1 % {}{\ttfamily\upshape\hspace*{-0.1ex}}1 {<}{\textrm{\textless}}1 {>}{\textrm{\textgreater}}1 {<-}{\makebox[2ex][c]{\textrm{\textless}\raisebox{0.5ex}{\rule{0.8ex}{0.075ex}}}}2, } \lstnewenvironment{cfa}[1][] {\lstset{#1}} {} \lstnewenvironment{Go}[1][] {\lstset{language=go,moredelim=**[is][\protect\color{red}]{}{},#1}\lstset{#1}} {\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{}{},#1}\lstset{#1}} {} \lstnewenvironment{python}[1][] These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like C pthreads. \CFA introduces modern language-level control-flow mechanisms, like generators, coroutines, user-level threading, and monitors for mutual exclusion and synchronization. Library extension for executors, futures, and actors are built on these basic mechanisms. % Library extension for executors, futures, and actors are built on these basic mechanisms. The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and optional monitor barging. The runtime also ensures multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), and this feature is fully integrated with all monitor synchronization mechanisms. However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions (member/method) implemented by an implicit \lstinline@this@ (receiver) parameter.}, backwards-compatible extension of the C programming language. In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control flow capabilities on top of a highly popular programming language allowing immediate dissemination. Within the \CFA framework, new control-flow features are created from scratch. ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}. Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}. Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads. As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms. As a result, languages like Java, Scala, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms. From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,BoostThreads}, including putting green threads back into Java~\cite{Quasar}. The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}. \item providing statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features. \item library extensions for executors, futures, and actors built on the basic mechanisms. % \item % library extensions for executors, futures, and actors built on the basic mechanisms. \item a runtime system with no spurious wakeup. \item a dynamic partitioning mechanism to segregate the execution environment for specialized requirements. \item a non-blocking I/O library % \item % a non-blocking I/O library \item experimental results showing comparable performance of the new features with similar mechanisms in other programming languages. Therefore, selecting between stackless and stackful semantics is a tradeoff between programming requirements and performance, where stackless is faster and stackful is more general. Note, creation cost is amortized across usage, so activation cost is usually the dominant factor. \begin{figure} \hspace{3pt} \subfloat[C generator implementation]{\label{f:CFibonacciSim}\usebox\myboxC} \caption{Fibonacci (output) Asymmetric Generator} \caption{Fibonacci (output) asymmetric generator} \label{f:FibonacciAsymmetricGenerator} \subfloat[C generator simulation]{\label{f:CFormatSim}\usebox\myboxB} \hspace{3pt} \caption{Formatter (input) Asymmetric Generator} \caption{Formatter (input) asymmetric generator} \label{f:FormatterAsymmetricGenerator} \end{figure} For generators, coroutines, and threads, many designs are based on function objects or pointers~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. For example, Python presents generators as a function object: \begin{python} def Gen(): ... yield val ... gen = Gen() for i in range( 10 ): print( next( gen ) ) \end{python} Boost presents coroutines in terms of four functor object-types: \begin{cfa} asymmetric_coroutine<>::pull_type asymmetric_coroutine<>::push_type symmetric_coroutine<>::call_type symmetric_coroutine<>::yield_type \end{cfa} and many languages present threading using function pointers, @pthreads@~\cite{Butenhof97}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}, \eg pthreads: \begin{cfa} void * rtn( void * arg ) { ... } int i = 3, rc; pthread_t t; $\C{// thread id}$ rc = pthread_create( &t, rtn, (void *)i ); $\C{// create and initialized task, type-unsafe input parameter}$ \end{cfa} % void mycor( pthread_t cid, void * arg ) { %       int * value = (int *)arg;                               $\C{// type unsafe, pointer-size only}$ %       // thread body % } % int main() { %       int input = 0, output; %       coroutine_t cid = coroutine_create( &mycor, (void *)&input ); $\C{// type unsafe, pointer-size only}$ %       coroutine_resume( cid, (void *)input, (void **)&output ); $\C{// type unsafe, pointer-size only}$ % } \CFA's preferred presentation model for generators/coroutines/threads is a hybrid of objects and functions, with an object-oriented flavour. Essentially, the generator/coroutine/thread function is semantically coupled with a generator/coroutine/thread custom type. The custom type solves several issues, while accessing the underlying mechanisms used by the custom types is still allowed. \subsection{Generator} Stackless generators have the potential to be very small and fast, \ie as small and fast as function call/return for both creation and execution. The \CFA goal is to achieve this performance target, possibly at the cost of some semantic complexity. How this goal is accomplished is done through a series of different kinds of generators and their implementation. A series of different kinds of generators and their implementation demonstrate how this goal is accomplished. Figure~\ref{f:FibonacciAsymmetricGenerator} shows an unbounded asymmetric generator for an infinite sequence of Fibonacci numbers written in C and \CFA, with a simple C implementation for the \CFA version. Figure~\ref{f:CFibonacci} shows the C approach of manually creating the closure in structure @Fib@, and multiple instances of this closure provide multiple Fibonacci generators. The C version only has the middle execution state because the top execution state becomes declaration initialization. Figure~\ref{f:CFAFibonacciGen} shows the \CFA approach, which also has a manual closure, but replaces the structure with a special \CFA @generator@ type. This generator type is then connected to a function named @main@ that takes as its only parameter a reference to the generator type, called a \emph{generator main}. Figure~\ref{f:CFAFibonacciGen} shows the \CFA approach, which also has a manual closure, but replaces the structure with a custom \CFA @generator@ type. This generator type is then connected to a function that \emph{must be named \lstinline|main|},\footnote{ The name \lstinline|main| has special meaning in C, specifically the function where a program starts execution. Hence, overloading this name for other starting points (generator/coroutine/thread) is a logical extension.} which takes as its only parameter a reference to the generator type, called a \emph{generator main}. The generator main contains @suspend@ statements that suspend execution without ending the generator versus @return@. For the Fibonacci generator-main,\footnote{ For contrast, Figure~\ref{f:PythonFibonacci} shows the equivalent Python Fibonacci generator, which does not use a generator type, and hence only has a single interface, but an implicit closure. Having to manually create the generator closure by moving local-state variables into the generator type is an additional programmer burden and possible point of error. Having to manually create the generator closure by moving local-state variables into the generator type is an additional programmer burden. (This restriction is removed by the coroutine, see Section~\ref{s:Coroutine}.) Our concern is that static analysis to discriminate local state from temporary variables is complex and beyond the current scope of the \CFA project. As well, supporting variable-size local-state, like variable-length arrays, requires dynamic allocation of the local state, which significantly increases the cost of generator creation/destruction, and is a show-stopper for embedded programming. This requirement follows from the generality of variable-size local-state, \eg local state with a variable-length array requires dynamic allocation because the array size is unknown at compile time. However, dynamic allocation significantly increases the cost of generator creation/destruction and is a show-stopper for embedded real-time programming. But more importantly, the size of the generator type is tied to the local state in the generator main, which precludes separate compilation of the generator main, \ie a generator must be inlined or local state must be dynamically allocated. Our current experience is that most generator problems have simple data state, including local state, but complex execution state. With respect to safety, we believe static analysis can discriminate local state from temporary variables in a generator, \ie variable usage spanning @suspend@, and generate a compile-time error. Finally, our current experience is that most generator problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small. As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators. \begin{python}[aboveskip=0pt,belowskip=0pt] def Fib(): fn1, fn = 0, 1 while True: yield fn1 fn1, fn = fn, fn1 + fn fn1, fn = 0, 1 while True: yield fn1 fn1, fn = fn, fn1 + fn f1 = Fib() f2 = Fib() \hspace{3pt} \subfloat[Formatter]{\label{f:PythonFormatter}\usebox\myboxB} \caption{Python Generator} \caption{Python generator} \label{f:PythonGenerator} generator PingPong { const char * name; int N; int i;                          // local state PingPong & partner; // rebindable reference int N, i; }; void ?{}(PingPong & pp, char * nm, int N) with(pp) { name = nm;  partner = 0p;  pp.N = N;  i = 0; } void main( PingPong & pp ) with(pp) { for ( ; i < N; i += 1 ) { int main() { enum { N = 5 }; PingPong ping = {"ping", N}, pong = {"pong", N}; PingPong ping = {"ping",N,0}, pong = {"pong",N,0}; &ping.partner = &pong;  &pong.partner = &ping; resume( ping ); typedef struct PingPong { const char * name; int N, i; struct PingPong * partner; int N, i; void * next; } PingPong; #define PPCtor(name, N) { name, NULL, N, 0, NULL } #define PPCtor(name, N) {name,N,0,NULL,NULL} void comain( PingPong * pp ) { if ( pp->next ) goto *pp->next; \subfloat[C generator simulation]{\label{f:CPingPongSim}\usebox\myboxB} \hspace{3pt} \caption{Ping-Pong Symmetric Generator} \caption{Ping-Pong symmetric generator} \label{f:PingPongSymmetricGenerator} \end{figure} A coroutine is specified by replacing @generator@ with @coroutine@ for the type. This generality results in higher cost for creation, due to dynamic stack allocation, execution, due to context switching among stacks, and terminating, due to possible stack unwinding and dynamic stack deallocation. How coroutines differ from generators is done through a series of examples. A series of different kinds of coroutines and their implementation demonstrate how coroutines extend generators. First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main. \end{cfa} \end{description} It is also possible to refactor code containing local-state and @suspend@ statements into a helper routine, like the computation of the CRC for the device driver. It is also possible to refactor code containing local-state and @suspend@ statements into a helper function, like the computation of the CRC for the device driver. \begin{cfa} unsigned int Crc() { \begin{comment} Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @coroutine Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@. Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @coroutine Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface functions, \eg @next@. Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main. The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code represents the three states in the Fibonacci formula via the three suspend points, to context switch back to the caller's @resume@. The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@; The interface function @next@, takes a Fibonacci instance and context switches to it using @resume@; on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. The first @resume@ is special because it allocates the coroutine stack and cocalls its coroutine main on that stack; def Fib(): fn1, fn = 0, 1 while True: yield fn1 fn1, fn = fn, fn1 + fn fn1, fn = 0, 1 while True: yield fn1 fn1, fn = fn, fn1 + fn \hspace{3pt} \subfloat[Python]{\label{f:ExternalState}\usebox\myboxC} \caption{Fibonacci Generator} \caption{Fibonacci generator} \label{f:C-fibonacci} \end{figure} \caption{Producer / consumer: resume-resume cycle, bi-directional communication} \label{f:ProdCons} \medskip \begin{center} \input{FullProdConsStack.pstex_t} \end{center} \vspace*{-10pt} \caption{Producer / consumer runtime stacks} \label{f:ProdConsRuntimeStacks} \medskip \begin{center} \input{FullCoroutinePhases.pstex_t} \end{center} \vspace*{-10pt} \caption{Ping / Pong coroutine steps} \label{f:PingPongFullCoroutineSteps} \end{figure} The loop then repeats calling @payment@, where each call resumes the producer coroutine. Figure~\ref{f:ProdConsRuntimeStacks} shows the runtime stacks of the program main, and the coroutine mains for @prod@ and @cons@ during the cycling. \begin{figure} \begin{center} \input{FullProdConsStack.pstex_t} \end{center} \vspace*{-10pt} \caption{Producer / consumer runtime stacks} \label{f:ProdConsRuntimeStacks} \medskip \begin{center} \input{FullCoroutinePhases.pstex_t} \end{center} \vspace*{-10pt} \caption{Ping / Pong coroutine steps} \label{f:PingPongFullCoroutineSteps} \end{figure} Terminating a coroutine cycle is more complex than a generator cycle, because it requires context switching to the program main's \emph{stack} to shutdown the program, whereas generators started by the program main run on its stack. \subsection{Coroutine Implementation} A significant implementation challenge for coroutines (and threads, see Section~\ref{threads}) is adding extra fields and executing code after/before the coroutine constructor/destructor and coroutine main to create/initialize/de-initialize/destroy extra fields and the stack. There are several solutions to this problem and the chosen option directed the \CFA coroutine design. For object-oriented languages, inheritance can be used to provide extra fields and code, but it requires explicitly writing the inheritance: \subsection{(Generator) Coroutine Implementation} A significant implementation challenge for generators/coroutines (and threads, see Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack. There are several solutions to these problem, which follow from the object-oriented-flavoured choice to adopt custom types. For object-oriented languages, inheritance is used to provide extra fields and code via explicit inheritance: \begin{cfa}[morekeywords={class,inherits}] class mycoroutine inherits baseCoroutine { ... } \end{cfa} In addition, the programming language and possibly its tool set, \eg debugger, @valgrind@ need to understand @baseCoroutine@ because of special stack property of coroutines. Furthermore, the execution of constructors/destructors is in the wrong order for certain operations. For example, for threads if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. class myCoroutine inherits baseCoroutine { ... } \end{cfa} The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types. As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors. An alternative is composition: \begin{cfa} struct mycoroutine { ... // declarations struct myCoroutine { ... // declaration/communication variables baseCoroutine dummy; // composition, last declaration } \end{cfa} which also requires an explicit declaration that must be last to ensure correct initialization order. However, there is nothing preventing wrong placement or multiple declarations of @baseCoroutine@. For coroutines, as for threads, many implementations are based on function pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. For example, Boost implements coroutines in terms of four functor object-types: \begin{cfa} asymmetric_coroutine<>::pull_type asymmetric_coroutine<>::push_type symmetric_coroutine<>::call_type symmetric_coroutine<>::yield_type \end{cfa} However, there is nothing preventing wrong placement or multiple declarations. \CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code and when it is unsafe to perform certain optimizations, and IDEs using simple parsing can find and manipulate types with special properties. The downside of this approach is that it makes custom types a special case in the language. Users wanting to extend custom types or build their own can only do so in ways offered by the language. Furthermore, implementing custom types without language supports may display the power of a programming language. \CFA blends the two approaches, providing custom type for idiomatic \CFA code, extending and building new custom types is still possible, similar to Java approach with builtin and library concurrency. Part of the mechanism to generalize custom types is the \CFA trait, \eg the definition for custom-type @coroutine@ is anything satisfying the trait @is_coroutine@, and this trait both enforces and restricts the coroutine-interface functions. \begin{cfa} trait is_coroutine( dtype T ) { void main( T & ); coroutine_desc * get_coroutine( T & ); }; forall( dtype T | is_coroutine(T) ) void $suspend$( T & ); forall( dtype T | is_coroutine(T) ) void resume( T & ); \end{cfa} Note, copying generators/coroutines/threads is not meaningful. For example, a coroutine retains its last resumer and suspends back to it; having a copy also suspend back to the same resumer is undefined semantics. Furthermore, two coroutines cannot logically execute on the same stack. A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy. Now, the @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer). The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones. The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining @suspend@ and @resume@. The \CFA custom-type @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main: \begin{cquote} \begin{tabular}{@{}ccc@{}} \begin{cfa} coroutine MyCor { int value; }; \end{cfa} & {\Large $\Rightarrow$} & \begin{tabular}{@{}ccc@{}} \begin{cfa} struct MyCor { int value; coroutine_desc cor; }; \end{cfa} & \begin{cfa} static inline coroutine_desc * get_coroutine( MyCor & this ) { return &this.cor; } \end{cfa} & \begin{cfa} void main( MyCor * this ); \end{cfa} \end{tabular} \end{tabular} \end{cquote} The combination of custom types and fundamental @trait@ description of these types allows an concise specification for programmers and tools, while more advanced programmers can have tighter control over memory layout and initialization. Figure~\ref{f:CoroutineMemoryLayout} shows different memory-layout options for a coroutine (where a task is similar). The coroutine handle is the @coroutine@ instance containing programmer specified global/communication variables. The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate.\footnote{ The coroutine handle is the @coroutine@ instance containing programmer specified global/communication variables across interface functions. The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate. The coroutine stack can appear in a number of locations and fixed or variable sized. Hence, the coroutine's stack could be a VLS\footnote{ We are examining variable-sized structures (VLS), where fields can be variable-sized structures or arrays. Once allocated, a VLS is fixed sized.} The coroutine stack can appear in a number of locations and forms, fixed or variable sized. Hence, the coroutine's stack could be a VLS on the allocating stack, provided the allocating stack is large enough. For stack allocation, allocation/deallocation only requires a few arithmetic operation to compute the size and adjust the stack point, modulo any constructor costs. For heap allocation, allocation/deallocation is an expensive heap operation (where the heap can be a shared resource), modulo any constructor costs. For heap stack allocation, it is also possible to use a split (segmented) stack calling-convention, available with gcc and clang, so the stack is variable sized. Currently, \CFA supports stack/heap allocated descriptors but only heap allocated stacks; split-stack allocation is under development. In \CFA debug-mode, a fixed-sized stack is terminated with a write-only page, which catches most stack overflows. on the allocating stack, provided the allocating stack is large enough. For a VLS stack allocation, allocation/deallocation is an inexpensive adjustment of the stack point, modulo any stack constructor costs (\eg initial frame setup). For heap stack allocation, allocation/deallocation is an expensive heap allocation (where the heap can be a shared resource), modulo any stack constructor costs. With heap stack allocation, it is also possible to use a split (segmented) stack calling-convention, available with gcc and clang, so the stack is variable sized. Currently, \CFA supports stack/heap allocated descriptors but only fixed-sized heap allocated stacks; In \CFA debug-mode, the fixed-sized stack is terminated with a write-only page, which catches most stack overflows. Experience teaching concurrency with \uC~\cite{CS343} shows fixed-sized stacks are rarely an issue for students. Split-stack allocation is under development but requires recompilation of existing code, which may be impossible. \begin{figure} \end{figure} Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthreads@~\cite{Butenhof97}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}. However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type, as in the pthreads approach. \begin{cfa} void mycor( coroutine_t cid, void * arg ) { int * value = (int *)arg;                               $\C{// type unsafe, pointer-size only}$ // Coroutine body } int main() { int input = 0, output; coroutine_t cid = coroutine_create( &mycor, (void *)&input ); $\C{// type unsafe, pointer-size only}$ coroutine_resume( cid, (void *)input, (void **)&output ); $\C{// type unsafe, pointer-size only}$ } \end{cfa} Since the custom type is simple to write in \CFA and solves several issues, added support for function/lambda-based coroutines adds very little. Note, the type @coroutine_t@ must be an abstract handle to the coroutine, because the coroutine descriptor and its stack are non-copyable. Copying the coroutine descriptor results in copies being out of date with the current state of the stack. Correspondingly, copying the stack results is copies being out of date with the coroutine descriptor, and pointers in the stack being out of date to data on the stack. (There is no mechanism in C to find all stack-specific pointers and update them as part of a copy.) The selected approach is to use language support by introducing a new kind of aggregate (structure): \begin{cfa} coroutine Fibonacci { int fn; // communication variables }; \end{cfa} The @coroutine@ keyword means the compiler (and tool set) can find and inject code where needed. The downside of this approach is that it makes coroutine a special case in the language. Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language. Furthermore, implementing coroutines without language supports also displays the power of a programming language. While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without language support. The reserved keyword simply eases use for the common case. Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait restricts the available set of coroutine-manipulation routines: \begin{cfa} trait is_coroutine( dtype T ) { void main( T & ); coroutine_desc * get_coroutine( T & ); }; forall( dtype T | is_coroutine(T) ) void suspend( T & ); forall( dtype T | is_coroutine(T) ) void resume( T & ); \end{cfa} The @dtype@ property provides no implicit copying operations and the @is_coroutine@ trait provides no explicit copying operations, so all coroutines must be passed by reference (pointer). The routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones. The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine, and possibly redefining @suspend@ and @resume@. The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main: \begin{cquote} \begin{tabular}{@{}ccc@{}} \begin{cfa} coroutine MyCor { int value; }; \end{cfa} & {\Large $\Rightarrow$} & \begin{tabular}{@{}ccc@{}} \begin{cfa} struct MyCor { int value; coroutine_desc cor; }; \end{cfa} & \begin{cfa} static inline coroutine_desc * get_coroutine( MyCor & this ) { return &this.cor; } \end{cfa} & \begin{cfa} void main( MyCor * this ); \end{cfa} \end{tabular} \end{tabular} \end{cquote} The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization. \section{Concurrency} \label{s:Concurrency} At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks. Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}. In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable. A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming; a \newterm{stackful} coroutine executes on its own stack, allowing full generality. Only stackful coroutines are a stepping stone to concurrency. The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}. Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next. Concurrency is nondeterministic scheduling of independent sequential execution-paths (threads), where each thread has its own stack. A single thread with multiple call stacks, \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}. In coroutining, coroutines self-schedule the thread across stacks so execution is deterministic. (It is \emph{impossible} to generate a concurrency error when coroutining.) However, coroutines are a stepping stone towards concurrency. The transition to concurrency, even for a single thread with multiple stacks, occurs when coroutines context switch to a \newterm{scheduling coroutine}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}. Therefore, a minimal concurrency system requires coroutines \emph{in conjunction with a nondeterministic scheduler}. The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}. Because the scheduler is special, it can either be a stackless or stackful coroutine. Adding \newterm{preemption} introduces non-cooperative scheduling, where context switching occurs randomly between any two instructions often based on a timer interrupt, called \newterm{preemptive scheduling}. While a scheduler introduces uncertain execution among explicit context switches, preemption introduces uncertainty by introducing implicit context switches. Uncertainty gives the illusion of parallelism on a single processor and provides a mechanism to access and increase performance on multiple processors. The reason is that the scheduler/runtime have complete knowledge about resources and how to best utilized them. However, the introduction of unrestricted nondeterminism results in the need for \newterm{mutual exclusion} and \newterm{synchronization}, which restrict nondeterminism for correctness; otherwise, it is impossible to write meaningful concurrent programs. Optimal concurrent performance is often obtained by having as much nondeterminism as mutual exclusion and synchronization correctness allow. A scheduler can either be a stackless or stackful. For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch. For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches. A stackful scheduler is often used for simplicity and security. Regardless of the approach used, a subset of concurrency related challenges start to appear. For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}. While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur. Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors. The reason is that only the runtime has complete knowledge about resources and how to best utilized them. However, the introduction of unrestricted non-determinism results in the need for \newterm{mutual exclusion} and \newterm{synchronization} to restrict non-determinism for correctness; otherwise, it is impossible to write meaningful programs. Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows. An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the paper, neither \protect\lstinline@gcc@ nor \protect\lstinline@clang@ support \protect\lstinline@threads.h@ in their standard libraries.}. In modern programming languages, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore existing and new programming languages must have tools for writing efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. Furthermore, because C is a system-level language, programmers expect to choose precisely which features they need and which cost they are willing to pay. Hence, concurrent programs should be written using high-level mechanisms, and only step down to lower-level mechanisms when performance bottlenecks are encountered. \subsection{Thread Interface} \label{threads} Both user and kernel threads are supported, where user threads provide concurrency and kernel threads provide parallelism. Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait: \subsection{Thread} \label{s:threads} Threading needs the ability to start a thread and wait for its completion. A common API for this ability is @fork@ and @join@. \begin{cquote} \begin{tabular}{@{}c@{\hspace{3\parindentlnth}}c@{}} \begin{cfa} thread myThread { // communication variables }; \begin{tabular}{@{}lll@{}} \multicolumn{1}{c}{\textbf{Java}} & \multicolumn{1}{c}{\textbf{\Celeven}} & \multicolumn{1}{c}{\textbf{pthreads}} \\ \begin{cfa}[aboveskip=0pt,belowskip=0pt] class MyTask extends Thread {...} mytask t = new MyTask(...); t.start(); // start // concurrency t.join(); // wait \end{cfa} & \begin{cfa} trait is_thread( dtype T ) { void main( T & ); thread_desc * get_thread( T & ); void ^?{}( T & mutex ); }; \begin{cfa}[aboveskip=0pt,belowskip=0pt] class MyTask { ... } // functor MyTask mytask; thread t( mytask, ... ); // start // concurrency t.join(); // wait \end{cfa} & \begin{cfa}[aboveskip=0pt,belowskip=0pt] void * rtn( void * arg ) {...} pthread_t t;  int i = 3; pthread_create( &t, rtn, (void *)i ); // start // concurrency pthread_join( t, NULL ); // wait \end{cfa} \end{tabular} \end{cquote} (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.) Like a coroutine, the statically-typed @main@ routine is the starting point (first stack frame) of a user thread. The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@; whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{ The \lstinline@main@ routine is already a special routine in C, \ie where the program's initial thread begins, so it is a natural extension of this semantics to use overloading to declare \lstinline@main@s for user coroutines and threads.} No return value or additional parameters are necessary for this routine because the task type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values. \begin{comment} % put in appendix with coroutine version ??? As such the @main@ routine of a thread can be defined as \begin{cfa} thread foo {}; void main(foo & this) { sout | "Hello World!"; } \end{cfa} In this example, threads of type @foo@ start execution in the @void main(foo &)@ routine, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a routine pointer as a parameter and executes it on its stack asynchronously. \begin{cfa} typedef void (*voidRtn)(int); thread RtnRunner { voidRtn func; int arg; }; void ?{}(RtnRunner & this, voidRtn inRtn, int arg) { this.func = inRtn; this.arg  = arg; } void main(RtnRunner & this) { // thread starts here and runs the routine this.func( this.arg ); } void hello(/*unused*/ int) { sout | "Hello World!"; } \CFA has a simpler approach using a custom @thread@ type and leveraging declaration semantics (allocation/deallocation), where threads implicitly @fork@ after construction and @join@ before destruction. \begin{cfa} thread MyTask {}; void main( MyTask & this ) { ... } int main() { RtnRunner f = {hello, 42}; return 0? } \end{cfa} A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{API}. \end{comment} For user threads to be useful, it must be possible to start and stop the underlying thread, and wait for it to complete execution. While using an API such as @fork@ and @join@ is relatively common, such an interface is awkward and unnecessary. A simple approach is to use allocation/deallocation principles, and have threads implicitly @fork@ after construction and @join@ before destruction. \begin{cfa} thread World {}; void main( World & this ) { sout | "World!"; } MyTask team[10]; $\C[2.5in]{// allocate stack-based threads, implicit start after construction}$ // concurrency } $\C{// deallocate stack-based threads, implicit joins before destruction}$ \end{cfa} This semantic ensures a thread is started and stopped exactly once, eliminating some programming error, and scales to multiple threads for basic (termination) synchronization. For block allocation to arbitrary depth, including recursion, threads are created/destroyed in a lattice structure (tree with top and bottom). Arbitrary topologies are possible using dynamic allocation, allowing threads to outlive their declaration scope, identical to normal dynamically allocating. \begin{cfa} MyTask * factory( int N ) { ... return anew( N ); } $\C{// allocate heap-based threads, implicit start after construction}$ int main() { World w[10];                                                  $\C{// implicit forks after creation}$ sout | "Hello ";                                        $\C{// "Hello " and 10 "World!" printed concurrently}$ }                                                                                       $\C{// implicit joins before destruction}$ \end{cfa} This semantics ensures a thread is started and stopped exactly once, eliminating some programming error, and scales to multiple threads for basic (termination) synchronization. This tree-structure (lattice) create/delete from C block-structure is generalized by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created. \begin{cfa} int main() { MyThread * heapLive; { MyThread blockLive;                                     $\C{// fork block-based thread}$ heapLive = new( MyThread );           $\C{// fork heap-based thread}$ ... }                                                                               $\C{// join block-based thread}$ ... delete( heapLive );                                   $\C{// join heap-based thread}$ } \end{cfa} The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency. MyTask * team = factory( 10 ); // concurrency delete( team ); $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$ } \end{cfa} Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequentially, after all the row threads have terminated. The allocation/deallocation pattern appears unusual because allocated objects are immediately deallocated without any intervening code. However, for threads, the deletion provides implicit synchronization, which is the intervening code. While the subtotals are added in linear order rather than completion order, which slightly inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial. % While the subtotals are added in linear order rather than completion order, which slightly inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial. \begin{figure} thread Adder { int * row, cols, & subtotal; } $\C{// communication variables}$ void ?{}( Adder & adder, int row[], int cols, int & subtotal ) { adder.[ row, cols, &subtotal ] = [ row, cols, &subtotal ]; adder.[ row, cols, &subtotal ] = [ row, cols, &subtotal ]; } void main( Adder & adder ) with( adder ) { subtotal = 0; for ( int c = 0; c < cols; c += 1 ) { subtotal += row[c]; } subtotal = 0; for ( c; cols ) { subtotal += row[c]; } } int main() { const int rows = 10, cols = 1000; int matrix[rows][cols], subtotals[rows], total = 0; // read matrix Adder * adders[rows]; for ( int r = 0; r < rows; r += 1 ) {       $\C{// start threads to sum rows}$ const int rows = 10, cols = 1000; int matrix[rows][cols], subtotals[rows], total = 0; // read matrix Adder * adders[rows]; for ( r; rows; ) { $\C{// start threads to sum rows}$ adders[r] = new( matrix[r], cols, &subtotals[r] ); } for ( int r = 0; r < rows; r += 1 ) {       $\C{// wait for threads to finish}$ delete( adders[r] );                          $\C{// termination join}$ total += subtotals[r];                          $\C{// total subtotal}$ } sout | total; } \end{cfa} \caption{Concurrent Matrix Summation} } for ( r; rows ) { $\C{// wait for threads to finish}$ delete( adders[r] ); $\C{// termination join}$ total += subtotals[r]; $\C{// total subtotal}$ } sout | total; } \end{cfa} \caption{Concurrent matrix summation} \label{s:ConcurrentMatrixSummation} \end{figure} \subsection{Thread Implementation} Threads in \CFA are user level run by runtime kernel threads (see Section~\ref{s:Parallelism}), where user threads provide concurrency and kernel threads provide parallelism. Like coroutines, and for the same design reasons, \CFA provides a custom @thread@ type and a @trait@ to enforce and restrict the task-interface functions. \begin{cquote} \begin{tabular}{@{}c@{\hspace{3\parindentlnth}}c@{}} \begin{cfa} thread myThread { ... // declaration/communication variables }; \end{cfa} & \begin{cfa} trait is_thread( dtype T ) { void main( T & ); thread_desc * get_thread( T & ); void ^?{}( T & mutex ); }; \end{cfa} \end{tabular} \end{cquote} Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_coroutine@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer). Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing. (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.) The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack starts running the coroutine main on the stack; whereas, a thread is scheduling for execution in @main@ immediately after its constructor is run. No return value or additional parameters are necessary for this function because the @thread@ type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values. \begin{comment} % put in appendix with coroutine version ??? As such the @main@ function of a thread can be defined as \begin{cfa} thread foo {}; void main(foo & this) { sout | "Hello World!"; } \end{cfa} In this example, threads of type @foo@ start execution in the @void main(foo &)@ function, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the function-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously. \begin{cfa} typedef void (*voidRtn)(int); thread RtnRunner { voidRtn func; int arg; }; void ?{}(RtnRunner & this, voidRtn inRtn, int arg) { this.func = inRtn; this.arg  = arg; } void main(RtnRunner & this) { // thread starts here and runs the function this.func( this.arg ); } void hello(/*unused*/ int) { sout | "Hello World!"; } int main() { RtnRunner f = {hello, 42}; return 0? } \end{cfa} A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{API}. \end{comment} \section{Mutual Exclusion / Synchronization} Uncontrolled non-deterministic execution is meaningless. To reestablish meaningful execution requires mechanisms to reintroduce determinism, \ie restrict non-determinism, called mutual exclusion and synchronization, where mutual exclusion is an access-control mechanism on data shared by threads, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it, \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala). In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely related to networking concepts, \eg channels~\cite{CSP,Go}. However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm between regular and concurrent computation, \eg routine call versus message passing. Hence, a programmer must learn and manipulate two sets of design patterns. Uncontrolled nondeterministic execution produces meaningless results. To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access-control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. Some concurrent systems eliminate mutable shared-state by switching to stateless communication like message passing~\cite{Thoth,Harmony,V-Kernel,MPI} (Erlang, MPI), channels~\cite{CSP} (CSP,Go), actors~\cite{Akka} (Akka, Scala), or functional techniques (Haskell). However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm between regular and concurrent computation, \eg function call versus message passing. Hence, a programmer must learn and manipulate two sets of design/programming patterns. While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account. In contrast, approaches based on stateful models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm. A group of instructions manipulating a specific instance of shared data that must be performed atomically is called an (individual) \newterm{critical-section}~\cite{Dijkstra65}. The generalization is called a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously. The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers have the same session and all writers have a unique session. \newterm{Mutual exclusion} enforces that the correct kind and number of threads are using a critical section. The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers share a session but writers have a unique session. \newterm{Mutual exclusion} enforces the correct kind and number of threads use a critical section. However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use. Synchronization enforces relative ordering of execution, and synchronization tools provide numerous mechanisms to establish these timing relationships. Low-level synchronization primitives offer good performance and flexibility at the cost of ease of use; higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg message passing, or offering a simpler solution to otherwise involved challenges, \eg barrier lock. Often synchronization is used to order access to a critical section, \eg ensuring a reader thread is the next kind of thread to enter a critical section. If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader \newterm{barged}. higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg receive-specific versus receive-any thread in message passing or offering specialized solutions, \eg barrier lock. Often synchronization is used to order access to a critical section, \eg ensuring a waiting writer thread enters the critical section before a calling reader thread. If the calling reader is scheduled before the waiting writer, the reader has \newterm{barged}. Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation). Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs. This challenge is often split into two different approaches: barging avoidance and barging prevention. Algorithms that allow a barger, but divert it until later using current synchronization state (flags), are avoiding the barger; algorithms that preclude a barger from entering during synchronization in the critical section prevent barging completely. Techniques like baton-passing locks~\cite{Andrews89} between threads instead of unconditionally releasing locks is an example of barging prevention. Preventing or detecting barging is an involved challenge with low-level locks, which is made easier through higher-level constructs. This challenge is often split into two different approaches: barging avoidance and prevention. Algorithms that unconditionally releasing a lock for competing threads to acquire use barging avoidance during synchronization to force a barging thread to wait. algorithms that conditionally hold locks during synchronization, \eg baton-passing~\cite{Andrews89}, prevent barging completely. \label{s:Monitor} A \textbf{monitor} is a set of routines that ensure mutual exclusion when accessing shared state. More precisely, a monitor is a programming technique that binds mutual exclusion to routine scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation). The strong association with the call/return paradigm eases programmability, readability and maintainability, at a slight cost in flexibility and efficiency. Note, like coroutines/threads, both locks and monitors require an abstract handle to reference them, because at their core, both mechanisms are manipulating non-copyable shared-state. Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. As for coroutines/tasks, the @dtype@ property provides no implicit copying operations and the @is_monitor@ trait provides no explicit copying operations, so all locks/monitors must be passed by reference (pointer). A \textbf{monitor} is a set of functions that ensure mutual exclusion when accessing shared state. More precisely, a monitor is a programming technique that binds mutual exclusion to function scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation). The strong association with the call/return paradigm eases programming, comprehension, and maintenance, at a slight cost in flexibility and efficiency. \CFA uses a custom @monitor@ type and leverages declaration semantics (deallocation) to protect active or waiting threads in a monitor. The following is a \CFA monitor implementation of an atomic counter. \newpage \begin{cfa}[morekeywords=nomutex] monitor Aint { int cnt; }; $\C[4.25in]{// atomic integer counter}$ int ++?( Aint & mutex$$$_{opt}$$$ this ) with( this ) { return ++cnt; } $\C{// increment}$ int ?=?( Aint & mutex$$$_{opt}$$$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}\CRT$ int ?=?( int & lhs, Aint & mutex$$$_{opt}$$$ rhs ) with( rhs ) { lhs = cnt; } \end{cfa} % The @Aint@ constructor, @?{}@, uses the \lstinline[morekeywords=nomutex]@nomutex@ qualifier indicating mutual exclusion is unnecessary during construction because an object is inaccessible (private) until after it is initialized. % (While a constructor may publish its address into a global variable, doing so generates a race-condition.) The prefix increment operation, @++?@, is normally @mutex@, indicating mutual exclusion is necessary during function execution, to protect the incrementing from race conditions, unless there is an atomic increment instruction for the implementation type. The assignment operators provide bi-directional conversion between an atomic and normal integer without accessing field @cnt@; these operations only need @mutex@, if reading/writing the implementation type is not atomic. The atomic counter is used without any explicit mutual-exclusion and provides thread-safe semantics, which is similar to the \CC template @std::atomic@. \begin{cfa} int i = 0, j = 0, k = 5; Aint x = { 0 }, y = { 0 }, z = { 5 }; $\C{// no mutex required}$ ++x; ++y; ++z; $\C{// safe increment by multiple threads}$ x = 2; y = i; z = k; $\C{// conversions}$ i = x; j = y; k = z; \end{cfa} \CFA monitors have \newterm{multi-acquire} semantics so the thread in the monitor may acquire it multiple times without deadlock, allowing recursion and calling other interface functions. \begin{cfa} monitor M { ... } m; void foo( M & mutex m ) { ... } $\C{// acquire mutual exclusion}$ void bar( M & mutex m ) { $\C{// acquire mutual exclusion}$ ... bar( m ); ... foo( m ); ... $\C{// reacquire mutual exclusion}$ } \end{cfa} \CFA monitors also ensure the monitor lock is always released regardless of how an acquiring function ends (normal or exceptional), and returning a shared variable is safe via copying before the lock is released. Similar safety is offered by explicit mechanisms like \CC RAII; monitor implicit safety ensures no programmer error. Furthermore, RAII mechansims cannot handle complex synchronization within a monitor, where the monitor lock may not be released on function exit but passed to an unblocking thread. RAII is purely a mutual-exclusion mechanism (see Section~\ref{s:Scheduling}). \subsection{Monitor Implementation} For the same design reasons, \CFA provides a custom @monitor@ type and a @trait@ to enforce and restrict the monitor-interface functions. \begin{cquote} \begin{tabular}{@{}c@{\hspace{3\parindentlnth}}c@{}} \begin{cfa} monitor M { ... // shared data }; \end{cfa} & \begin{cfa} trait is_monitor( dtype T ) { }; \end{cfa} \end{tabular} \end{cquote} Like before, the @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference (pointer). % Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. % Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data. The custom monitor type also inserts any locking vehicles needed to implement the monitor locking semnatics. \label{s:MutexAcquisition} While correctness implies a monitor's mutual exclusion is acquired and released, there are implementation options about when and where the locking/unlocking occurs. While the monitor lock provides mutual exclusion for shared data, there are implementation options for when and where the locking/unlocking occurs. (Much of this discussion also applies to basic locks.) For example, a monitor may need to be passed through multiple helper routines before it becomes necessary to acquire the monitor mutual-exclusion. \begin{cfa}[morekeywords=nomutex] monitor Aint { int cnt; };                                      $\C{// atomic integer counter}$ void ?{}( Aint & nomutex this ) with( this ) { cnt = 0; } $\C{// constructor}$ int ?=?( Aint & mutex$$$_{opt}$$$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}$ void ?{}( int & this, Aint & mutex$$$_{opt}$$$ v ) { this = v.cnt; } int ?=?( int & lhs, Aint & mutex$$$_{opt}$$$ rhs ) with( rhs ) { lhs = cnt; } int ++?( Aint & mutex$$$_{opt}$$$ this ) with( this ) { return ++cnt; } $\C{// increment}$ \end{cfa} The @Aint@ constructor, @?{}@, uses the \lstinline[morekeywords=nomutex]@nomutex@ qualifier indicating mutual exclusion is unnecessary during construction because an object is inaccessible (private) until after it is initialized. (While a constructor may publish its address into a global variable, doing so generates a race-condition.) The conversion operators for initializing and assigning with a normal integer only need @mutex@, if reading/writing the implementation type is not atomic. Finally, the prefix increment operato, @++?@, is normally @mutex@ to protect the incrementing from race conditions, unless there is an atomic increment instruction for the implementation type. The atomic counter is used without any explicit mutual-exclusion and provides thread-safe semantics, which is similar to the \CC template @std::atomic@. \begin{cfa} Aint x, y, z; ++x; ++y; ++z;                                                          $\C{// safe increment by multiple threads}$ x = 2; y = 2; z = 2;                                            $\C{// conversions}$ int i = x, j = y, k = z; i = x; j = y; k = z; \end{cfa} For maximum usability, monitors have \newterm{multi-acquire} semantics allowing a thread to acquire it multiple times without deadlock. \begin{cfa} monitor M { ... } m; void foo( M & mutex m ) { ... }                         $\C{// acquire mutual exclusion}$ void bar( M & mutex m ) {                                       $\C{// acquire mutual exclusion}$ ... foo( m ); ...                                             $\C{// reacquire mutual exclusion}$ } bar( m );                                                                     $\C{// nested monitor call}$ \end{cfa} For example, a monitor may be passed through multiple helper functions before it is necessary to acquire the monitor's mutual exclusion. The benefit of mandatory monitor qualifiers is self-documentation, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameters is redundant. The next semantic decision is establishing which parameter \emph{types} may be qualified with @mutex@. Given: The following has monitor parameter types that are composed of multiple objects. \begin{cfa} monitor M { ... } int f1( M & mutex m ); int f2( M * mutex m ); int f3( M * mutex m[] ); int f4( stack( M * ) & mutex m ); \end{cfa} the issue is that some of these parameter types are composed of multiple objects. For @f1@, there is only a single parameter object. Adding indirection in @f2@ still identifies a single object. However, the matrix in @f3@ introduces multiple objects. While shown shortly, multiple acquisition is possible; however array lengths are often unknown in C. This issue is exacerbated in @f4@, where the data structure must be safely traversed to acquire all of its elements. To make the issue tractable, \CFA only acquires one monitor per parameter with at most one level of indirection. However, there is an ambiguity in the C type-system with respects to arrays. Is the argument for @f2@ a single object or an array of objects? If it is an array, only the first element of the array is acquired, which seems unsafe; hence, @mutex@ is disallowed for array parameters. \begin{cfa} int f1( M & mutex m );                                          $\C{// allowed: recommended case}$ int f2( M * mutex m );                                          $\C{// disallowed: could be an array}$ int f3( M mutex m[$\,$] );                                      $\C{// disallowed: array length unknown}$ int f4( M ** mutex m );                                         $\C{// disallowed: could be an array}$ int f5( M * mutex m[$\,$] );                            $\C{// disallowed: array length unknown}$ \end{cfa} % Note, not all array routines have distinct types: @f2@ and @f3@ have the same type, as do @f4@ and @f5@. % However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic. For object-oriented monitors, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @rec.foo(...)@. \CFA has no receiver, and hence, must use an explicit mechanism to specify which object acquires mutual exclusion. A positive consequence of this design decision is the ability to support multi-monitor routines. \begin{cfa} int f( M & mutex x, M & mutex y );              $\C{// multiple monitor parameter of any type}$ M m1, m2; f( m1, m2 ); \end{cfa} (While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.) In practice, writing multi-locking routines that do not deadlock is tricky. Having language support for such a feature is therefore a significant asset for \CFA. The capability to acquire multiple locks before entering a critical section is called \newterm{bulk acquire} (see Section~\ref{s:Implementation} for implementation details). In the previous example, \CFA guarantees the order of acquisition is consistent across calls to different routines using the same monitors as arguments. This consistent ordering means acquiring multiple monitors is safe from deadlock. However, users can force the acquiring order. For example, notice the use of @mutex@/\lstinline[morekeywords=nomutex]@nomutex@ and how this affects the acquiring order: \begin{cfa} void foo( M & mutex m1, M & mutex m2 );         $\C{// acquire m1 and m2}$ int f1( M & mutex m ); $\C{// single parameter object}$ int f2( M * mutex m ); $\C{// single or multiple parameter object}$ int f3( M * mutex m[$\,$] ); $\C{// multiple parameter object}$ int f4( stack( M * ) & mutex m ); $\C{// multiple parameters object}$ \end{cfa} Function @f1@ has a single parameter object, while @f2@'s indirection could be a single or multi-element array, where static array size is often unknown in C. Function @f3@ has a multiple object matrix, and @f4@ a multiple object data structure. While shown shortly, multiple object acquisition is possible, but the number of objects must be statically known. Therefore, \CFA only acquires one monitor per parameter with at most one level of indirection, excluding pointers as it is impossible to statically determine the size. For object-oriented monitors, \eg Java, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @rec.foo(...)@. \CFA has no receiver, and hence, the explicit @mutex@ qualifier is used to specify which objects acquire mutual exclusion. A positive consequence of this design decision is the ability to support multi-monitor functions,\footnote{ While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.} called \newterm{bulk acquire} (see Section~\ref{s:Implementation} for implementation details). \CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock. Figure~\ref{f:BankTransfer} shows a trivial solution to the bank-account transfer problem using \CFA monitors with implicit locking and \CC with explicit locking. A \CFA programmer only has to manage when to acquire mutual exclusion; a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options. Making good choices for common cases in \CFA simplifies the programming experience and enhances safety. \begin{figure} \centering \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] monitor BankAccount { int balance; } b1 = { 0 }, b2 = { 0 }; void deposit( BankAccount & mutex b, int deposit ) with(b) { balance += deposit; } void transfer( BankAccount & mutex my, BankAccount & mutex your, int me2you ) { deposit( my, -me2you ); // debit deposit( your, me2you ); // credit } thread Person { BankAccount & b1, & b2; }; void main( Person & person ) with(person) { for ( 10_000_000 ) { if ( random() % 3 ) deposit( b1, 3 ); if ( random() % 3 ) transfer( b1, b2, 7 ); } } int main() { Person p1 = { b1, b2 }, p2 = { b2, b1 }; } // wait for threads to complete \end{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] struct BankAccount { recursive_mutex m; int balance = 0; } b1, b2; void deposit( BankAccount & b, int deposit ) { scoped_lock lock( b.m ); b.balance += deposit; } void transfer( BankAccount & my, BankAccount & your, int me2you ) { scoped_lock lock( my.m, your.m ); deposit( my, -me2you ); // debit deposit( your, me2you ); // credit } void person( BankAccount & b1, BankAccount & b2 ) { for ( int i = 0; i < 10$'$000$'$000; i += 1 ) { if ( random() % 3 ) deposit( b1, 3 ); if ( random() % 3 ) transfer( b1, b2, 7 ); } } int main() { thread p1(person, ref(b1), ref(b2)), p2(person, ref(b2), ref(b1)); p1.join(); p2.join(); } \end{cfa} \end{lrbox} \subfloat[\CFA]{\label{f:CFABank}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[\CC]{\label{f:C++Bank}\usebox\myboxB} \hspace{3pt} \caption{Bank transfer problem} \label{f:BankTransfer} \end{figure} Users can force the acquiring order, by using @mutex@/\lstinline[morekeywords=nomutex]@nomutex@. \begin{cfa} void foo( M & mutex m1, M & mutex m2 ); $\C{// acquire m1 and m2}$ void bar( M & mutex m1, M & /* nomutex */ m2 ) { $\C{// acquire m1}$ ... foo( m1, m2 ); ...                                  $\C{// acquire m2}$ ... foo( m1, m2 ); ... $\C{// acquire m2}$ } void baz( M & /* nomutex */ m1, M & mutex m2 ) { $\C{// acquire m2}$ ... foo( m1, m2 ); ...                                  $\C{// acquire m1}$ } \end{cfa} The multi-acquire semantics allows @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@. In the calls to @bar@ and @baz@, the monitors are acquired in opposite order. However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamic tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. In \CFA, a safety aid is provided by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid. While \CFA provides only a partial solution, it handles many useful cases, \eg: \begin{cfa} monitor BankAccount { ... }; void deposit( BankAccount & mutex b, int deposit ); void transfer( BankAccount & mutex my, BankAccount & mutex your, int me2you ) { deposit( my, -me2you );                               $\C{// debit}$ deposit( your, me2you );                                $\C{// credit}$ } \end{cfa} This example shows a trivial solution to the bank-account transfer problem. Without multi- and bulk acquire, the solution to this problem requires careful engineering. ... foo( m1, m2 ); ... $\C{// acquire m1}$ } \end{cfa} The bulk-acquire semantics allow @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@. In the calls to @bar@ and @baz@, the monitors are acquired in opposite order, resulting in deadlock. However, this case is the simplest instance of the \emph{nested-monitor problem}~\cite{Lister77}, where monitors are acquired in sequence versus bulk. Detecting the nested-monitor problem requires dynamic tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. \CFA does not deal with this fundamental problem. \label{mutex-stmt} The monitor call-semantics associate all locking semantics to routines. The monitor call-semantics associate all locking semantics with functions. Like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming. \begin{cquote} \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}} \multicolumn{1}{c}{\textbf{\lstinline@mutex@ call}} & \multicolumn{1}{c}{\lstinline@mutex@ \textbf{statement}} \\ \begin{cfa} monitor M { ... }; \end{cfa} \\ \multicolumn{1}{c}{\textbf{routine call}} & \multicolumn{1}{c}{\lstinline@mutex@ \textbf{statement}} \end{tabular} \end{cquote} \section{Scheduling} \subsection{Scheduling} \label{s:Scheduling} While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed. For example, Figure~\ref{f:GenericBoundedBuffer} shows a bounded buffer that may be full/empty so produce/consumer threads must block. While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed, \eg a bounded buffer may be full/empty so produce/consumer threads must block. Leaving the monitor and trying again (busy waiting) is impractical for high-level programming. Monitors eliminate busy waiting by providing synchronization to schedule threads needing access to the shared data, where threads block versus spinning. Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} defines which thread acquires the critical section next. \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry. Finally, \CFA monitors do not allow calling threads to barge ahead of signalled threads, which simplifies synchronization among threads in the monitor and increases correctness. If barging is allowed, synchronization between a signaller and signallee is difficult, often requiring additional flags and multiple unblock/block cycles. In fact, signals-as-hints is completely opposite from that proposed by Hoare in the seminal paper on monitors: \begin{cquote} However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74} \end{cquote} Furthermore, \CFA concurrency has no spurious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implicit form of barging. Hence, a \CFA @wait@ statement is not enclosed in a @while@ loop that can cause thread starvation. Figure~\ref{f:BBInt} shows a \CFA generic bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@. The @wait@ routine atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the routine's parameter list. The @wait@ function atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the function's parameter list. The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer. Signalling is unconditional, because signalling an empty condition lock does nothing. \end{lrbox} \subfloat[Internal Scheduling]{\label{f:BBInt}\usebox\myboxA} \subfloat[Internal scheduling]{\label{f:BBInt}\usebox\myboxA} %\qquad \subfloat[External Scheduling]{\label{f:BBExt}\usebox\myboxB} \caption{Generic Bounded-Buffer} \subfloat[External scheduling]{\label{f:BBExt}\usebox\myboxB} \caption{Generic bounded-buffer} \label{f:GenericBoundedBuffer} \end{figure} Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until there is a free/empty slot in the buffer. External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion. Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/\-empty buffer block and prevent more producers/consumers from entering the monitor until there is a free/empty slot in the buffer. External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion. If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. Threads making calls to routines that are currently excluded, block outside of (external to) the monitor on a calling queue, versus blocking on condition queues inside of (internal to) the monitor. Threads making calls to functions that are currently excluded, block outside of (external to) the monitor on a calling queue, versus blocking on condition queues inside of (internal to) the monitor. External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go channels. The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go @select@ on channels. While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics. Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor routines (see Section~\ref{s:Multi-MonitorScheduling}). Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor functions (see Section~\ref{s:Multi-MonitorScheduling}). For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; For signal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the opposite number, post its phone number, and unblock the partner. For signal-block scheduling, the implicit urgent-queue replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher. The dating service is an example of a monitor that cannot be written using external scheduling because it requires knowledge of calling parameters to make scheduling decisions, and parameters of waiting threads are unavailable; as well, an arriving thread may not find a partner and must wait, which requires a condition variable, and condition variables imply internal scheduling. Furthermore, barging corrupts the dating service during an exchange because a barger may also match and change the phone numbers, invalidating the previous exchange phone number. Putting loops around the @wait@s does not correct the problem; the solution must be restructured to account for barging. \begin{figure} \qquad \subfloat[\lstinline@signal_block@]{\label{f:DatingSignalBlock}\usebox\myboxB} \caption{Dating service. } \caption{Dating service} \label{f:DatingService} \end{figure} To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@. Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe. While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric. Finally, a signaller, \begin{cfa} Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@. @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given routine or routine pointer. To statically verify the released monitors match with the accepted routine's mutex parameters, the routine (pointer) prototype must be accessible. % When an overloaded routine appears in an @waitfor@ statement, calls to any routine with that name are accepted. @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer. To statically verify the released monitors match with the accepted function's mutex parameters, the function (pointer) prototype must be accessible. % When an overloaded function appears in an @waitfor@ statement, calls to any function with that name are accepted. % The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call. Overloaded routines can be disambiguated using a cast: Overloaded functions can be disambiguated using a cast \begin{cfa} void rtn( M & mutex m ); ... signal( e ); ... \end{cfa} The @wait@ only releases @m1@ so the signalling thread cannot acquire both @m1@ and @m2@ to  enter @bar@ to get to the @signal@. While deadlock issues can occur with multiple/nesting acquisition, this issue results from the fact that locks, and by extension monitors, are not perfectly composable. Finally, an important aspect of monitor implementation is barging, \ie can calling threads barge ahead of signalled threads? If barging is allowed, synchronization between a signaller and signallee is difficult, often requiring multiple unblock/block cycles (looping around a wait rechecking if a condition is met). In fact, signals-as-hints is completely opposite from that proposed by Hoare in the seminal paper on monitors: \begin{quote} However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74} \end{quote} \CFA scheduling \emph{precludes} barging, which simplifies synchronization among threads in the monitor and increases correctness. Furthermore, \CFA concurrency has no spurious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implict form of barging. For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}. Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. The @wait@ only releases @m1@ so the signalling thread cannot acquire @m1@ and @m2@ to enter @bar@ and @signal@ the condition. While deadlock can occur with multiple/nesting acquisition, this is a consequence of locks, and by extension monitors, not being perfectly composable. % Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. \begin{cquote} \subfloat[Signalling Thread]{\label{f:SignallingThread}\usebox\myboxA} \hspace{2\parindentlnth} \hspace{3\parindentlnth} \subfloat[Waiting Thread (W1)]{\label{f:WaitingThread}\usebox\myboxB} \hspace{2\parindentlnth} In an object-oriented programming-language, a class includes an exhaustive list of operations. However, new members can be added via static inheritance or dynamic members, \eg JavaScript~\cite{JavaScript}. Similarly, monitor routines can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. Similarly, monitor functions can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. \begin{cfa} monitor M { ... }; void f( M & mutex m ); void g( M & mutex m ) { waitfor( f ); }       $\C{// clear which f}$ void f( M & mutex m, int );                           $\C{// different f}$ void h( M & mutex m ) { waitfor( f ); }       $\C{// unclear which f}$ \end{cfa} Hence, the cfa-code for entering a monitor looks like: \begin{cfa} if ( $\textrm{\textit{monitor is free}}$ ) $\LstCommentStyle{// \color{red}enter}$ else if ( $\textrm{\textit{already own monitor}}$ ) $\LstCommentStyle{// \color{red}continue}$ else if ( $\textrm{\textit{monitor accepts me}}$ ) $\LstCommentStyle{// \color{red}enter}$ else $\LstCommentStyle{// \color{red}block}$ void g( M & mutex m ) { waitfor( f ); } $\C{// clear which f}$ void f( M & mutex m, int ); $\C{// different f}$ void h( M & mutex m ) { waitfor( f ); } $\C{// unclear which f}$ \end{cfa} Hence, the \CFA code for entering a monitor looks like: \begin{cfa} if ( $\textrm{\textit{monitor is free}}$ ) $\C{// enter}$ else if ( $\textrm{\textit{already own monitor}}$ ) $\C{// continue}$ else if ( $\textrm{\textit{monitor accepts me}}$ ) $\C{// enter}$ else $\C{// block}$ \end{cfa} For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions. {\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}} }% subfloat \caption{Monitor Implementation} \caption{Monitor implementation} \label{f:MonitorImplementation} \end{figure} For a fixed (small) number of mutex routines (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction. This approach requires a unique dense ordering of routines with a small upper-bound and the ordering must be consistent across translation units. For object-oriented languages these constraints are common, but \CFA mutex routines can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex routines. For a fixed (small) number of mutex functions (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction. This approach requires a unique dense ordering of functions with a small upper-bound and the ordering must be consistent across translation units. For object-oriented languages these constraints are common, but \CFA mutex functions can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex functions. Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation. The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately. The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a (usually short) linear search. The mutex function called is associated with each thread on the entry queue, while a list of acceptable functions is kept separately. The accepted list is a variable-sized array of accepted function pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a (usually short) linear search. External scheduling, like internal scheduling, becomes significantly more complex for multi-monitor semantics. Even in the simplest case, new semantics needs to be established. \newpage \begin{cfa} monitor M { ... }; void f( M & mutex m1 ); void g( M & mutex m1, M & mutex m2 ) { waitfor( f );                                                   $\C{\color{red}// pass m1 or m2 to f?}$ } void g( M & mutex m1, M & mutex m2 ) { waitfor( f ); } $\C{// pass m1 or m2 to f?}$ \end{cfa} The solution is for the programmer to disambiguate: \begin{cfa} waitfor( f, m2 );                                               $\C{\color{red}// wait for call to f with argument m2}$ \end{cfa} Both locks are acquired by routine @g@, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@, while @g@ still holds lock @m1@. waitfor( f, m2 ); $\C{// wait for call to f with argument m2}$ \end{cfa} Both locks are acquired by function @g@, so when function @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@, while @g@ still holds lock @m1@. This behaviour can be extended to the multi-monitor @waitfor@ statement. \begin{cfa} monitor M { ... }; void f( M & mutex m1, M & mutex m2 ); void g( M & mutex m1, M & mutex m2 ) { waitfor( f, m1, m2 );                                   $\C{\color{red}// wait for call to f with arguments m1 and m2}$ } \end{cfa} Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by the accepting routine. void g( M & mutex m1, M & mutex m2 ) { waitfor( f, m1, m2 ); $\C{// wait for call to f with arguments m1 and m2}$ \end{cfa} Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by the accepting function. Also, the order of the monitors in a @waitfor@ statement is unimportant. \begin{figure} \lstDeleteShortInline@% \begin{tabular}{@{}l@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}} \begin{cfa} \centering \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] monitor M1 {} m11, m12; monitor M2 {} m2; f( m12, m2 ); // cannot fulfil \end{cfa} & \begin{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] monitor M1 {} m11, m12; monitor M2 {} m2; f( m12, m2 ); // cannot fulfil \end{cfa} \end{tabular} \lstMakeShortInline@% \end{lrbox} \subfloat[Internal scheduling]{\label{f:InternalScheduling}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[External scheduling]{\label{f:ExternalScheduling}\usebox\myboxB} \caption{Unmatched \protect\lstinline@mutex@ sets} \label{f:UnmatchedMutexSets} \subsection{Extended \protect\lstinline@waitfor@} Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex routines, with a specific action to be performed \emph{after} the mutex routine finishes. Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with a specific action to be performed \emph{after} the mutex function finishes. For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. The \emph{conditional-expression} of a @when@ may call a routine, but the routine must not block or context switch. If there are multiple acceptable mutex calls, selection occurs top-to-bottom (prioritized) in the @waitfor@ clauses, whereas some programming languages with similar mechanisms accept non-deterministically for this case, \eg Go \lstinline[morekeywords=select]@select@. The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch. If there are multiple acceptable mutex calls, selection occurs top-to-bottom (prioritized) in the @waitfor@ clauses, whereas some programming languages with similar mechanisms accept nondeterministically for this case, \eg Go \lstinline[morekeywords=select]@select@. If some accept guards are true and there are no outstanding calls to these members, the acceptor is accept-blocked until a call to one of these members is made. If all the accept guards are false, the statement does nothing, unless there is a terminating @else@ clause with a true guard, which is executed instead. \begin{figure} \centering \begin{cfa} when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ else if ( C2 ) waitfor( mem2 );         or when ( C2 ) waitfor( mem2 ); \end{cfa} The left example accepts only @mem1@ if @C1@ is true or only @mem2@ if @C2@ is true. The left example only accepts @mem1@ if @C1@ is true or only @mem2@ if @C2@ is true. The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true. \subsection{\protect\lstinline@mutex@ Threads} Threads in \CFA are monitors to allow direct communication among threads, \ie threads can have mutex routines that are called by other threads. Threads in \CFA can also be monitors to allow \emph{direct communication} among threads, \ie threads can have mutex functions that are called by other threads. Hence, all monitor features are available when using threads. Figure~\ref{f:DirectCommunication} shows a comparison of direct call communication in \CFA with direct channel communication in Go. (Ada provides a similar mechanism to the \CFA direct communication.) The program main in both programs communicates directly with the other thread versus indirect communication where two threads interact through a passive monitor. Both direct and indirection thread communication are valuable tools in structuring concurrent programs. \begin{figure} \centering \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] struct Msg { int i, j; }; thread Gortn { int i;  float f;  Msg m; }; void mem1( Gortn & mutex gortn, int i ) { gortn.i = i; } void mem2( Gortn & mutex gortn, float f ) { gortn.f = f; } void mem3( Gortn & mutex gortn, Msg m ) { gortn.m = m; } void ^?{}( Gortn & mutex ) {} void main( Gortn & gortn ) with( gortn ) {  // thread starts for () { waitfor( mem1, gortn ) sout | i;  // wait for calls or waitfor( mem2, gortn ) sout | f; or waitfor( mem3, gortn ) sout | m.i | m.j; or waitfor( ^?{}, gortn ) break; } } int main() { Gortn gortn; $\C[2.0in]{// start thread}$ mem1( gortn, 0 ); $\C{// different calls}\CRT$ mem2( gortn, 2.5 ); mem3( gortn, (Msg){1, 2} ); } // wait for completion \end{cfa} \end{lrbox} \begin{lrbox}{\myboxB} \begin{Go}[aboveskip=0pt,belowskip=0pt] func main() { type Msg struct{ i, j int } ch1 := make( chan int ) ch2 := make( chan float32 ) ch3 := make( chan Msg ) hand := make( chan string ) shake := make( chan string ) gortn := func() { $\C[1.5in]{// thread starts}$ var i int;  var f float32;  var m Msg L: for { select { $\C{// wait for messages}$ case i = <- ch1: fmt.Println( i ) case f = <- ch2: fmt.Println( f ) case m = <- ch3: fmt.Println( m ) case <- hand: break L $\C{// sentinel}$ } } shake <- "SHAKE" $\C{// completion}$ } go gortn() $\C{// start thread}$ ch1 <- 0 $\C{// different messages}$ ch2 <- 2.5 ch3 <- Msg{1, 2} hand <- "HAND" $\C{// sentinel value}$ <- shake $\C{// wait for completion}\CRT$ } \end{Go} \end{lrbox} \subfloat[\CFA]{\label{f:CFAwaitfor}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[Go]{\label{f:Gochannel}\usebox\myboxB} \caption{Direct communication} \label{f:DirectCommunication} \end{figure} \begin{comment} The following shows an example of two threads directly calling each other and accepting calls from each other in a cycle. \begin{cfa} thread Ping {} pi; thread Pong {} po; void ping( Ping & mutex ) {} void pong( Pong & mutex ) {} int main() {} \end{cfa} \vspace{-0.8\baselineskip} \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}} \begin{cfa} thread Ping {} pi; void ping( Ping & mutex ) {} void main( Ping & pi ) { for ( int i = 0; i < 10; i += 1 ) { for ( 10 ) { waitfor( ping, pi ); pong( po ); } } int main() {} \end{cfa} & \begin{cfa} thread Pong {} po; void pong( Pong & mutex ) {} void main( Pong & po ) { for ( int i = 0; i < 10; i += 1 ) { for ( 10 ) { ping( pi ); waitfor( pong, po ); } } \end{cfa} \end{tabular} % \end{figure} Note, the ping/pong threads are globally declared, @pi@/@po@, and hence, start (and possibly complete) before the program main starts. \end{comment} \subsection{Execution Properties} Table~\ref{t:ObjectPropertyComposition} shows how the \CFA high-level constructs cover 3 fundamental execution properties: thread, stateful function, and mutual exclusion. Case 1 is a basic object, with none of the new execution properties. Case 2 allows @mutex@ calls to Case 1 to protect shared data. Case 3 allows stateful functions to suspend/resume but restricts operations because the state is stackless. Case 4 allows @mutex@ calls to Case 3 to protect shared data. Cases 5 and 6 are the same as 3 and 4 without restriction because the state is stackful. Cases 7 and 8 are rejected because a thread cannot execute without a stackful state in a preemptive environment when context switching from the signal handler. Cases 9 and 10 have a stackful thread without and with @mutex@ calls. For situations where threads do not require direct communication, case 9 provides faster creation/destruction by eliminating @mutex@ setup. \begin{table}[b] \caption{Object property composition} \centering \label{t:ObjectPropertyComposition} \renewcommand{\arraystretch}{1.25} %\setlength{\tabcolsep}{5pt} \begin{tabular}{c|c|l|l} \multicolumn{2}{c|}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\ \hline thread  & stateful                              & \multicolumn{1}{c|}{No} & \multicolumn{1}{c}{Yes} \\ \hline \hline No              & No                                    & \textbf{1}\ \ \ aggregate type                & \textbf{2}\ \ \ @monitor@ aggregate type \\ \hline No              & Yes (stackless)               & \textbf{3}\ \ \ @generator@                   & \textbf{4}\ \ \ @monitor@ generator \\ \hline No              & Yes (stackful)                & \textbf{5}\ \ \ @coroutine@                   & \textbf{6}\ \ \ @monitor@ @coroutine@ \\ \hline Yes             & No / Yes (stackless)  & \textbf{7}\ \ \ {\color{red}rejected} & \textbf{8}\ \ \ {\color{red}rejected} \\ \hline Yes             & Yes (stackful)                & \textbf{9}\ \ \ @thread@                              & \textbf{10}\ \ @monitor@ @thread@ \\ \end{tabular} \end{table} For completeness and efficiency, \CFA provides a standard set of low-level locks: recursive mutex, condition, semaphore, barrier, \etc, and atomic instructions: @fetchAssign@, @fetchAdd@, @testSet@, @compareSet@, \etc. However, we strongly advocate using high-level concurrency mechanisms whenever possible. \section{Parallelism} \label{s:Parallelism} Historically, computer performance was about processor speeds. However, with heat dissipation being a direct consequence of speed increase, parallelism is the new source for increased performance~\cite{Sutter05, Sutter05b}. Now, high-performance applications must care about parallelism, which requires concurrency. Therefore, high-performance applications must care about parallelism, which requires concurrency. The lowest-level approach of parallelism is to use \newterm{kernel threads} in combination with semantics like @fork@, @join@, \etc. However, kernel threads are better as an implementation tool because of complexity and higher cost. \subsection{User Threads with Preemption} \subsection{User Threads} A direct improvement on kernel threads is user threads, \eg Erlang~\cite{Erlang} and \uC~\cite{uC++book}. This approach provides an interface that matches the language paradigms, more control over concurrency by the language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems. This approach provides an interface that matches the language paradigms, gives more control over concurrency by the language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems. In many cases, user threads can be used on a much larger scale (100,000 threads). Like kernel threads, user threads support preemption, which maximizes nondeterminism, but introduces the same concurrency errors: race, livelock, starvation, and deadlock. Like kernel threads, user threads support preemption, which maximizes nondeterminism, but increases the potential for concurrency errors: race, livelock, starvation, and deadlock. \CFA adopts user-threads as they represent the truest realization of concurrency and can build any other concurrency approach, \eg thread pools and actors~\cite{Actors}. \subsection{User Threads without Preemption (Fiber)} \label{s:fibers} A variant of user thread is \newterm{fibers}, which removes preemption, \eg Go~\cite{Go} @goroutine@s. A variant of user thread is \newterm{fibres}, which removes preemption, \eg Go~\cite{Go} @goroutine@s. Like functional programming, which removes mutation and its associated problems, removing preemption from concurrency reduces nondeterminism, making race and deadlock errors more difficult to generate. However, preemption is necessary for concurrency that relies on spinning, so there are a class of problems that cannot be programmed without preemption. However, preemption is necessary for concurrency that relies on spinning, so there are a class of problems that cannot be programmed with fibres. \subsection{Thread Pools} In contrast to direct threading is indirect \newterm{thread pools}, where small jobs (work units) are inserted into a work pool for execution. In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution. If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. \centering \input{RunTimeStructure} \caption{\CFA Runtime Structure} \caption{\CFA Runtime structure} \label{f:RunTimeStructure} \end{figure} Preemption occurs on virtual processors rather than user threads, via operating-system interrupts. Thus virtual processors execute user threads, where preemption frequency applies to a virtual processor, so preemption occurs randomly across the executed user threads. Turning off preemption transforms user threads into fibers. Turning off preemption transforms user threads into fibres. \section{Implementation} \label{s:Implementation} Currently, \CFA has fixed-sized stacks, where the stack size can be set at coroutine/thread creation but with no subsequent growth. Schemes exist for dynamic stack-growth, such as stack copying and chained stacks. However, stack copying requires pointer adjustment to items on the stack, which is impossible without some form of garbage collection. As well, chained stacks require all modules be recompiled to use this feature, which breaks backward compatibility with existing C libraries. In the long term, it is likely C libraries will migrate to stack chaining to support concurrency, at only a minimal cost to sequential programs. Nevertheless, experience teaching \uC~\cite{CS343} shows fixed-sized stacks are rarely an issue in most concurrent programs. A primary implementation challenge is avoiding contention from dynamically allocating memory because of bulk acquire, \eg the internal-scheduling design is (almost) free of allocations. This array persists for the entire duration of the mutual exclusion and is used extensively for synchronization operations. To improve performance and simplicity, context switching occurs inside a routine call, so only callee-saved registers are copied onto the stack and then the stack register is switched; To improve performance and simplicity, context switching occurs inside a function call, so only callee-saved registers are copied onto the stack and then the stack register is switched; the corresponding registers are then restored for the other context. Note, the instruction pointer is untouched since the context switch is always inside the same routine. Unlike coroutines, threads do not context switch among each other; they context switch to the cluster scheduler. This method is a 2-step context-switch and provides a clear distinction between user and kernel code, where scheduling and other system operations happen. The alternative 1-step context-switch uses the \emph{from} thread's stack to schedule and then context-switches directly to the \emph{to} thread's stack. Experimental results (not presented) show the performance of these two approaches is virtually equivalent, because both approaches are dominated by locking to prevent a race condition. Note, the instruction pointer is untouched since the context switch is always inside the same function. Experimental results (not presented) for a stackless or stackful scheduler (1 versus 2 context switches) (see Section~\ref{s:Concurrency}) show the performance is virtually equivalent, because both approaches are dominated by locking to prevent a race condition. All kernel threads (@pthreads@) created a stack. However, on current UNIX systems: \begin{quote} \begin{cquote} A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal. SIGNAL(7) - Linux Programmer's Manual \end{quote} \end{cquote} Hence, the timer-expiry signal, which is generated \emph{externally} by the UNIX kernel to the UNIX process, is delivered to any of its UNIX subprocesses (kernel threads). To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events. \label{results} To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with other widely used programming languages with concurrency. Table~\ref{t:machine} shows the specifications of the computer used to run the benchmarks, and the versions of the software used in the comparison. To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0. The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.3 LTS and \uC and \CFA are compiled with gcc 6.3. \begin{comment} \begin{table} \centering \end{tabular} \end{table} All benchmarks are run using the following harness: \end{comment} All benchmarks are run using the following harness. \newpage \begin{cfa} unsigned int N = 10_000_000; #define BENCH( run ) Time before = getTimeNsec(); run; Duration result = (getTimeNsec() - before) / N; #define BENCH( run ) Time before = getTimeNsec();  run; Duration result = (getTimeNsec() - before) / N; \end{cfa} The method used to get time is @clock_gettime( CLOCK_REALTIME )@. \paragraph{Context-Switching} In procedural programming, the cost of a routine call is important as modularization (refactoring) increases. (In many cases, a compiler inlines routine calls to eliminate this cost.) In procedural programming, the cost of a function call is important as modularization (refactoring) increases. (In many cases, a compiler inlines function calls to eliminate this cost.) Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor. The coroutine context-switch is 2-step using resume/suspend, \ie from resumer to suspender and from suspender to resumer. The thread context switch is 2-step using yield, \ie enter and return from the runtime kernel. Figure~\ref{f:ctx-switch} shows the code for coroutines/threads with all results in Table~\ref{tab:ctx-switch}. The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches. The thread test is using yield to enter and return from the runtime kernel, which is two context switches. The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. \begin{figure} Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}. \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \newbox\myboxA \begin{lrbox}{\myboxA} \begin{cfa}[aboveskip=0pt,belowskip=0pt] coroutine C {} c; void main( C & ) { for ( ;; ) { @suspend();@ } } int main() { // coroutine test BENCH( for ( N ) { @resume( c );@ } ) sout | resultns; } int main() { // task test BENCH( for ( N ) { @yield();@ } ) sout | resultns; } \end{cfa} \captionof{figure}{\CFA context-switch benchmark} \label{f:ctx-switch} \columnbreak \vspace*{-16pt} \captionof{table}{Context switch comparison (nanoseconds)} \label{tab:ctx-switch} \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ Kernel Thread   & 333.5 & 332.96        & 4.1   \\ \CFA Coroutine  & 49    & 48.68         & 0.47  \\ \CFA Thread             & 105   & 105.57        & 1.37  \\ \uC Coroutine   & 44    & 44            & 0             \\ \uC Thread              & 100   & 99.29         & 0.96  \\ Goroutine               & 145   & 147.25        & 4.15  \\ Java Thread             & 373.5 & 375.14        & 8.72 \end{tabular} \end{multicols} \paragraph{Mutual-Exclusion} Mutual exclusion is measured by entering/leaving a critical section. For monitors, entering and leaving a monitor function is measured. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} monitor M {} m1/*, m2, m3, m4*/; void __attribute__((noinline)) do_call( M & mutex m/*, m2, m3, m4*/ ) {} int main() { BENCH( for ( size_t i = 0; i < N; i += 1 ) { @resume( c );@ } ) sout | resultns; } \end{cfa} \end{lrbox} \newbox\myboxB \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] int main() { BENCH( for ( size_t i = 0; i < N; i += 1 ) { @yield();@ } ) sout | resultns; } \end{cfa} \end{lrbox} \subfloat[Coroutine]{\usebox\myboxA} \quad \subfloat[Thread]{\usebox\myboxB} \captionof{figure}{\CFA context-switch benchmark} \label{f:ctx-switch} \centering \captionof{table}{Context switch comparison (nanoseconds)} \label{tab:ctx-switch} \bigskip \begin{tabular}{|r|*{3}{D{.}{.}{3.2}|}} \cline{2-4} \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{Median} &\multicolumn{1}{c|}{Average} & \multicolumn{1}{c|}{Std Dev} \\ \hline Kernel Thread   & 333.5 & 332.96        & 4.1 \\ \CFA Coroutine  & 49            & 48.68 & 0.47    \\ \CFA Thread             & 105           & 105.57        & 1.37 \\ \uC Coroutine   & 44            & 44            & 0 \\ \uC Thread              & 100           & 99.29 & 0.96 \\ Goroutine               & 145           & 147.25        & 4.15 \\ Java Thread             & 373.5 & 375.14        & 8.72 \\ \hline \end{tabular} \bigskip \bigskip \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} monitor M { ... } m1/*, m2, m3, m4*/; void __attribute__((noinline)) do_call( M & mutex m/*, m2, m3, m4*/ ) {} int main() { BENCH( for( size_t i = 0; i < N; i += 1 ) { @do_call( m1/*, m2, m3, m4*/ );@ } ) for( N ) @do_call( m1/*, m2, m3, m4*/ );@ ) sout | resultns; } \label{f:mutex} \centering \columnbreak \vspace*{-16pt} \captionof{table}{Mutex comparison (nanoseconds)} \label{tab:mutex} \bigskip \begin{tabular}{|r|*{3}{D{.}{.}{3.2}|}} \cline{2-4} \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{Median} &\multicolumn{1}{c|}{Average} & \multicolumn{1}{c|}{Std Dev} \\ \hline C routine                                       & 2             & 2             & 0    \\ FetchAdd + FetchSub                     & 26            & 26            & 0    \\ Pthreads Mutex Lock                     & 31            & 31.71 & 0.97 \\ \uC @monitor@ member routine            & 31            & 31            & 0    \\ \CFA @mutex@ routine, 1 argument        & 46            & 46.68 & 0.93  \\ \CFA @mutex@ routine, 2 argument        & 84            & 85.36 & 1.99 \\ \CFA @mutex@ routine, 4 argument        & 158           & 161           & 4.22 \\ Java synchronized routine               & 27.5  & 29.79 & 2.93  \\ \hline \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ C function                                              & 2                     & 2             & 0             \\ FetchAdd + FetchSub                             & 26            & 26    & 0             \\ Pthreads Mutex Lock                             & 31            & 31.71 & 0.97  \\ \uC @monitor@ member rtn.               & 31            & 31    & 0             \\ \CFA @mutex@ function, 1 arg.   & 46            & 46.68 & 0.93  \\ \CFA @mutex@ function, 2 arg.   & 84            & 85.36 & 1.99  \\ \CFA @mutex@ function, 4 arg.   & 158           & 161   & 4.22  \\ Java synchronized function              & 27.5          & 29.79 & 2.93 \end{tabular} \end{figure} \paragraph{Mutual-Exclusion} Mutual exclusion is measured by entering/leaving a critical section. For monitors, entering and leaving a monitor routine is measured. Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. To put the results in context, the cost of entering a non-inline routine and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. \end{multicols} Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking. \begin{figure} \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} volatile int go = 0; condition c; monitor M { ... } m; void __attribute__((noinline)) do_call( M & mutex a1 ) { signal( c ); } monitor M { condition c; } m; void __attribute__((noinline)) do_call( M & mutex a1 ) { signal( c ); } thread T {}; void main( T & this ) { while ( go == 0 ) { yield(); }  // wait for other thread to start while ( go == 0 ) { yield(); } while ( go == 1 ) { @do_call( m );@ } } int  __attribute__((noinline)) do_wait( M & mutex m ) { int  __attribute__((noinline)) do_wait( M & mutex m ) with(m) { go = 1; // continue other thread BENCH( for ( size_t i = 0; i < N; i += 1 ) { @wait( c );@ } ); BENCH( for ( N ) { @wait( c );@ } ); go = 0; // stop other thread sout | resultns; \label{f:int-sched} \centering \columnbreak \vspace*{-16pt} \captionof{table}{Internal-scheduling comparison (nanoseconds)} \label{tab:int-sched} \bigskip \begin{tabular}{|r|*{3}{D{.}{.}{5.2}|}} \cline{2-4} \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{Median} &\multicolumn{1}{c|}{Average} & \multicolumn{1}{c|}{Std Dev} \\ \hline Pthreads Condition Variable             & 6005  & 5681.43       & 835.45 \\ \uC @signal@                                    & 324           & 325.54        & 3,02   \\ \CFA @signal@, 1 @monitor@              & 368.5         & 370.61        & 4.77   \\ \CFA @signal@, 2 @monitor@              & 467           & 470.5 & 6.79   \\ \CFA @signal@, 4 @monitor@              & 700.5         & 702.46        & 7.23  \\ Java @notify@                                   & 15471 & 172511        & 5689 \\ \hline \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ Pthreads Cond. Variable         & 6005          & 5681.43       & 835.45        \\ \uC @signal@                            & 324           & 325.54        & 3,02          \\ \CFA @signal@, 1 @monitor@      & 368.5         & 370.61        & 4.77          \\ \CFA @signal@, 2 @monitor@      & 467           & 470.5         & 6.79          \\ \CFA @signal@, 4 @monitor@      & 700.5         & 702.46        & 7.23          \\ Java @notify@                           & 15471         & 172511        & 5689 \end{tabular} \end{figure} \end{multicols} Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. \begin{figure} \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \vspace*{-16pt} \begin{cfa} volatile int go = 0; monitor M { ... } m; monitor M {} m; thread T {}; void __attribute__((noinline)) do_call( M & mutex ) {} void __attribute__((noinline)) do_call( M & mutex ) {} void main( T & ) { while ( go == 0 ) { yield(); }  // wait for other thread to start while ( go == 0 ) { yield(); } while ( go == 1 ) { @do_call( m );@ } } int __attribute__((noinline)) do_wait( M & mutex m ) { int __attribute__((noinline)) do_wait( M & mutex m ) { go = 1; // continue other thread BENCH( for ( size_t i = 0; i < N; i += 1 ) { @waitfor( do_call, m );@ } ) BENCH( for ( N ) { @waitfor( do_call, m );@ } ) go = 0; // stop other thread sout | resultns; \label{f:ext-sched} \centering \columnbreak \vspace*{-16pt} \captionof{table}{External-scheduling comparison (nanoseconds)} \label{tab:ext-sched} \bigskip \begin{tabular}{|r|*{3}{D{.}{.}{3.2}|}} \cline{2-4} \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{Median} &\multicolumn{1}{c|}{Average} & \multicolumn{1}{c|}{Std Dev} \\ \hline \uC @_Accept@                           & 358           & 359.11        & 2.53  \\ \CFA @waitfor@, 1 @monitor@     & 359           & 360.93        & 4.07  \\ \CFA @waitfor@, 2 @monitor@     & 450           & 449.39        & 6.62  \\ \CFA @waitfor@, 4 @monitor@     & 652           & 655.64        & 7.73 \\ \hline \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ \uC @_Accept@                           & 358           & 359.11        & 2.53          \\ \CFA @waitfor@, 1 @monitor@     & 359           & 360.93        & 4.07          \\ \CFA @waitfor@, 2 @monitor@     & 450           & 449.39        & 6.62          \\ \CFA @waitfor@, 4 @monitor@     & 652           & 655.64        & 7.73 \end{tabular} \bigskip \medskip \end{multicols} \paragraph{Object Creation} Object creation is measured by creating/deleting the specific kind of concurrent object. Figure~\ref{f:creation} shows the code for \CFA, with results in Table~\ref{tab:creation}. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine to force stack creation, the creation cost is artificially low. \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} void main( MyThread & ) {} int main() { BENCH( for ( size_t i = 0; i < N; i += 1 ) { @MyThread m;@ } ) BENCH( for ( N ) { @MyThread m;@ } ) sout | resultns; } \label{f:creation} \centering \captionof{table}{Creation comparison (nanoseconds)} \columnbreak \vspace*{-16pt} \captionof{table}{Object creation comparison (nanoseconds)} \label{tab:creation} \bigskip \begin{tabular}{|r|*{3}{D{.}{.}{5.2}|}} \cline{2-4} \multicolumn{1}{c|}{} & \multicolumn{1}{c|}{Median} & \multicolumn{1}{c|}{Average} & \multicolumn{1}{c|}{Std Dev} \\ \hline Pthreads                                & 28091         & 28073.39      & 163.1  \\ \CFA Coroutine Lazy             & 6                     & 6.07          & 0.26   \\ \CFA Coroutine Eager    & 520           & 520.61        & 2.04   \\ \CFA Thread                             & 2032  & 2016.29       & 112.07  \\ \uC Coroutine                   & 106           & 107.36        & 1.47   \\ \uC Thread                              & 536.5 & 537.07        & 4.64   \\ Goroutine                               & 3103  & 3086.29       & 90.25  \\ Java Thread                             & 103416.5      & 103732.29     & 1137 \\ \hline \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ Pthreads                                & 28091         & 28073.39      & 163.1         \\ \CFA Coroutine Lazy             & 6                     & 6.07          & 0.26          \\ \CFA Coroutine Eager    & 520           & 520.61        & 2.04          \\ \CFA Thread                             & 2032          & 2016.29       & 112.07        \\ \uC Coroutine                   & 106           & 107.36        & 1.47          \\ \uC Thread                              & 536.5         & 537.07        & 4.64          \\ Goroutine                               & 3103          & 3086.29       & 90.25         \\ Java Thread                             & 103416.5      & 103732.29     & 1137          \\ \end{tabular} \end{figure} \paragraph{Object Creation} Object creation is measured by creating/deleting the specific kind of concurrent object. Figure~\ref{f:creation} shows the code for \CFA, with results in Table~\ref{tab:creation}. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine to force stack creation, the creation cost is artificially low. \end{multicols} \section{Conclusion} This paper demonstrates a concurrency API that is simple, efficient, and able to build higher-level concurrency features. The approach provides concurrency based on a preemptive M:N user-level threading-system, executing in clusters, which encapsulate scheduling of work on multiple kernel threads providing parallelism. Advanced control-flow will always be difficult, especially when there is temporal ordering and nondeterminism. However, many systems exacerbate the difficulty through their presentation mechanisms. This paper shows it is possible to present a hierarchy of control-flow features, generator, coroutine, thread, and monitor, providing an integrated set of high-level, efficient, and maintainable control-flow features. Eliminated from \CFA are spurious wakeup and barging, which are nonintuitive and lead to errors, and having to work with a bewildering set of low-level locks and acquisition techniques. \CFA high-level race-free monitors and tasks provide the core mechanisms for mutual exclusion and synchronization, without having to resort to magic qualifiers like @volatile@/@atomic@. Extending these mechanisms to handle high-level deadlock-free bulk acquire across both mutual exclusion and synchronization is a unique contribution. The \CFA runtime provides concurrency based on a preemptive M:N user-level threading-system, executing in clusters, which encapsulate scheduling of work on multiple kernel threads providing parallelism. The M:N model is judged to be efficient and provide greater flexibility than a 1:1 threading model. High-level objects (monitor/task) are the core mechanism for mutual exclusion and synchronization. A novel aspect is allowing multiple mutex-objects to be accessed simultaneously reducing the potential for deadlock for this complex scenario. These concepts and the entire \CFA runtime-system are written in the \CFA language, demonstrating the expressiveness of the \CFA language. These concepts and the entire \CFA runtime-system are written in the \CFA language, extensively leveraging the \CFA type-system, which demonstrates the expressiveness of the \CFA language. Performance comparisons with other concurrent systems/languages show the \CFA approach is competitive across all low-level operations, which translates directly into good performance in well-written concurrent applications. C programmers should feel comfortable using these mechanisms for developing concurrent applications, with the ability to obtain maximum available performance by mechanisms at the appropriate level. C programmers should feel comfortable using these mechanisms for developing complex control-flow in applications, with the ability to obtain maximum available performance by selecting mechanisms at the appropriate level of need. \section{Future Work} While concurrency in \CFA has a strong start, development is still underway and there are missing features. While control flow in \CFA has a strong start, development is still underway to complete a number of missing features. \paragraph{Flexible Scheduling} Different scheduling algorithms can affect performance (both in terms of average and variation). However, no single scheduler is optimal for all workloads and therefore there is value in being able to change the scheduler for given programs. One solution is to offer various tweaking options, allowing the scheduler to be adjusted to the requirements of the workload. One solution is to offer various tuning options, allowing the scheduler to be adjusted to the requirements of the workload. However, to be truly flexible, a pluggable scheduler is necessary. Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion. While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Examples of such tools can include futures and promises~\cite{promises}, executors and actors. These additional features are useful when monitors offer a level of abstraction that is inadequate for certain tasks. These additional features are useful for applications that can be constructed without shared data and direct blocking. As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks.
• ## doc/papers/concurrency/examples/PingPong.c

 r6625727 typedef struct PingPong { const char * name; int N, i; struct PingPong * partner; int N, i; void * next; } PingPong; #define PPCtor( name, N ) { name, NULL, N, 0, NULL } #define PPCtor( name, N ) { name, N, 0, NULL, NULL } void comain( PingPong * pp ) __attribute__(( noinline )); void comain( PingPong * pp ) {
• ## doc/papers/concurrency/examples/Pingpong.cfa

 r6625727 }; void ?{}( PingPong & this, const char * name, unsigned int N, PingPong & part ) { this.[name, N] = [name, N];  &this.part = ∂ } void ?{}( PingPong & this, const char * name, unsigned int N ) { this{ name, N, *0p };                                                           // call first constructor } // void ?{}( PingPong & this, const char * name, unsigned int N, PingPong & part ) { //      this.[name, N] = [name, N];  &this.part = ∂ // } // void ?{}( PingPong & this, const char * name, unsigned int N ) { //      this{ name, N, *0p };                                                           // call first constructor // } void cycle( PingPong & pingpong ) { resume( pingpong );
• ## doc/papers/concurrency/figures/corlayout.fig

 r6625727 4 1 0 50 -1 0 10 0.0000 2 150 705 7050 825 stack$_2$\001 4 1 0 50 -1 0 10 0.0000 2 135 1230 2550 1125 descriptor pointer\001 4 2 0 50 -1 4 10 0.0000 2 120 675 1875 825 coroutine\001 4 1 0 50 -1 0 10 0.0000 2 105 375 2550 900 fields\001 4 2 0 50 -1 0 10 0.0000 2 105 465 1800 975 handle\001 4 1 0 50 -1 0 10 0.0000 2 150 1020 2550 900 program fields\001 4 2 0 50 -1 0 10 0.0000 2 135 690 1875 1525 descriptor\001 4 2 0 50 -1 0 10 0.0000 2 105 465 1875 975 handle\001
• ## doc/papers/concurrency/figures/ext_monitor.fig

 r6625727 -2 1200 2 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1575.000 3450.000 1575 3150 1275 3450 1575 3750 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1575.000 4350.000 1575 4050 1275 4350 1575 4650 6 4275 1950 4575 2250 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4425 2100 105 105 4425 2100 4530 2205 4 1 -1 0 0 0 10 0.0000 2 105 90 4425 2160 d\001 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 3600.000 1500 3300 1200 3600 1500 3900 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 4500.000 1500 4200 1200 4500 1500 4800 6 4200 2100 4500 2400 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2250 105 105 4350 2250 4455 2355 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 2310 d\001 -6 6 4275 1650 4575 1950 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4425 1800 105 105 4425 1800 4530 1905 4 1 -1 0 0 0 10 0.0000 2 105 90 4425 1860 b\001 6 4200 1800 4500 2100 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1950 105 105 4350 1950 4455 2055 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 2010 b\001 -6 6 1495 5445 5700 5655 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 1575 5550 80 80 1575 5550 1655 5630 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2925 5550 105 105 2925 5550 3030 5655 1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 4425 5550 105 105 4425 5550 4530 5655 4 0 -1 0 0 0 12 0.0000 2 135 1035 3150 5625 blocked task\001 4 0 -1 0 0 0 12 0.0000 2 135 870 1725 5625 active task\001 4 0 -1 0 0 0 12 0.0000 2 135 1050 4650 5625 routine mask\001 6 1420 5595 5625 5805 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 1500 5700 80 80 1500 5700 1580 5780 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2850 5700 105 105 2850 5700 2955 5805 1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 4350 5700 105 105 4350 5700 4455 5805 4 0 -1 0 0 0 12 0.0000 2 135 1035 3075 5775 blocked task\001 4 0 -1 0 0 0 12 0.0000 2 135 870 1650 5775 active task\001 4 0 -1 0 0 0 12 0.0000 2 135 1050 4575 5775 routine mask\001 -6 6 3525 1800 3825 2400 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3675 1950 105 105 3675 1950 3780 1950 6 3450 1950 3750 2550 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3600 2100 105 105 3600 2100 3705 2100 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 3525 1800 3825 1800 3825 2400 3525 2400 3525 1800 4 1 4 0 0 0 10 0.0000 2 105 120 3675 2010 Y\001 3450 1950 3750 1950 3750 2550 3450 2550 3450 1950 4 1 4 0 0 0 10 0.0000 2 105 120 3600 2160 Y\001 -6 6 3525 2100 3825 2400 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3675 2250 105 105 3675 2250 3780 2250 4 1 4 0 0 0 10 0.0000 2 105 120 3675 2295 X\001 6 3450 2250 3750 2550 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3600 2400 105 105 3600 2400 3705 2400 4 1 4 0 0 0 10 0.0000 2 105 120 3600 2445 X\001 -6 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1725 3600 105 105 1725 3600 1830 3705 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2025 3600 105 105 2025 3600 2130 3705 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5025 3900 105 105 5025 3900 5130 4005 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5325 3900 105 105 5325 3900 5430 4005 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4425 2700 105 105 4425 2700 4530 2805 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4425 2400 105 105 4425 2400 4530 2505 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3525 4575 80 80 3525 4575 3605 4655 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1650 3750 105 105 1650 3750 1755 3855 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1950 3750 105 105 1950 3750 2055 3855 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4950 4050 105 105 4950 4050 5055 4155 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5250 4050 105 105 5250 4050 5355 4155 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2850 105 105 4350 2850 4455 2955 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2550 105 105 4350 2550 4455 2655 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3450 4725 80 80 3450 4725 3530 4805 2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 2475 2925 3900 2925 3900 3225 2475 3225 2475 2925 2400 3075 3825 3075 3825 3375 2400 3375 2400 3075 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 1575 3750 2175 3750 2175 4050 1575 4050 1500 3900 2100 3900 2100 4200 1500 4200 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3 1575 3450 2175 3450 2325 3675 1500 3600 2100 3600 2250 3825 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2175 3150 2025 3375 2100 3300 1950 3525 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3 1575 4350 2175 4350 2325 4575 1500 4500 2100 4500 2250 4725 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2175 4050 2025 4275 2100 4200 1950 4425 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 1575 4650 2175 4650 2175 4950 3375 4950 1500 4800 2100 4800 2100 5100 3300 5100 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 4875 3750 4725 3975 4800 3900 4650 4125 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 3375 4950 3600 5100 3300 5100 3525 5250 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9 3675 4950 4875 4950 4875 4050 5475 4050 5475 3750 4875 3750 4875 2850 4575 2850 4575 1650 3600 5100 4800 5100 4800 4200 5400 4200 5400 3900 4800 3900 4800 3000 4500 3000 4500 1800 2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5 4275 4200 4275 3300 2775 3300 2775 4200 4275 4200 4200 4350 4200 3450 2700 3450 2700 4350 4200 4350 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2 1 1 1.00 60.00 120.00 3675 3075 3675 2400 3600 3225 3600 2550 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2 4125 2850 4575 3000 4050 3000 4500 3150 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5 1575 3150 2175 3150 2175 2850 4125 2850 4125 1650 4 1 -1 0 0 0 10 0.0000 2 75 75 4425 2745 a\001 4 1 -1 0 0 0 10 0.0000 2 75 75 4425 2445 c\001 4 1 -1 0 0 0 12 0.0000 2 135 315 3525 5325 exit\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1725 3075 A\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1725 4875 condition\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1725 5100 B\001 4 0 -1 0 0 0 12 0.0000 2 135 420 5025 3675 stack\001 4 0 -1 0 0 0 12 0.0000 2 180 750 5025 3225 acceptor/\001 4 0 -1 0 0 0 12 0.0000 2 180 750 5025 3450 signalled\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1725 2850 condition\001 4 1 -1 0 0 0 12 0.0000 2 165 420 4425 1350 entry\001 4 1 -1 0 0 0 12 0.0000 2 135 495 4425 1575 queue\001 4 0 -1 0 0 0 12 0.0000 2 135 525 4725 2400 arrival\001 4 0 -1 0 0 0 12 0.0000 2 135 630 4725 2175 order of\001 4 1 -1 0 0 0 12 0.0000 2 135 525 3525 3675 shared\001 4 1 -1 0 0 0 12 0.0000 2 135 735 3525 3975 variables\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4150 1875 X\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4150 2175 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4150 2475 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4150 2775 X\001 4 0 -1 0 0 3 12 0.0000 2 150 540 5025 4275 urgent\001 4 1 0 50 -1 0 11 0.0000 2 165 600 3150 3150 accepted\001 1500 3300 2100 3300 2100 3000 4050 3000 4050 1800 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 2895 a\001 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 2595 c\001 4 1 -1 0 0 0 12 0.0000 2 135 315 3450 5475 exit\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 3225 A\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 5025 condition\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 5250 B\001 4 0 -1 0 0 0 12 0.0000 2 135 420 4950 3825 stack\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 3375 acceptor/\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 3600 signalled\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 3000 condition\001 4 0 -1 0 0 0 12 0.0000 2 135 525 4650 2550 arrival\001 4 0 -1 0 0 0 12 0.0000 2 135 630 4650 2325 order of\001 4 1 -1 0 0 0 12 0.0000 2 135 525 3450 3825 shared\001 4 1 -1 0 0 0 12 0.0000 2 135 735 3450 4125 variables\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2025 X\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2325 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2625 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2925 X\001 4 0 -1 0 0 3 12 0.0000 2 150 540 4950 4425 urgent\001 4 1 0 50 -1 0 11 0.0000 2 165 600 3075 3300 accepted\001 4 1 -1 0 0 0 12 0.0000 2 165 960 4275 1725 entry queue\001
• ## doc/papers/concurrency/figures/monitor.fig

 r6625727 -2 1200 2 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 2700.000 1500 2400 1200 2700 1500 3000 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 3600.000 1500 3300 1200 3600 1500 3900 6 4200 1200 4500 1500 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1350 105 105 4350 1350 4455 1455 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 1410 d\001 5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 4500.000 1500 4200 1200 4500 1500 4800 6 2400 2400 2700 2700 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 2550 105 105 2550 2550 2655 2550 4 1 -1 0 0 0 10 0.0000 2 105 90 2550 2610 b\001 -6 6 4200 900 4500 1200 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1050 105 105 4350 1050 4455 1155 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 1110 b\001 6 2400 2700 2700 3000 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 2850 105 105 2550 2850 2655 2850 4 1 -1 0 0 0 10 0.0000 2 75 75 2550 2895 a\001 -6 6 2400 1500 2700 1800 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 1650 105 105 2550 1650 2655 1650 4 1 -1 0 0 0 10 0.0000 2 105 90 2550 1710 b\001 6 3300 2400 3600 2700 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 2550 105 105 3450 2550 3555 2550 4 1 -1 0 0 0 10 0.0000 2 105 90 3450 2610 d\001 -6 6 2400 1800 2700 2100 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 1950 105 105 2550 1950 2655 1950 4 1 -1 0 0 0 10 0.0000 2 75 75 2550 1995 a\001 6 1350 5550 5325 5850 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 1500 5700 80 80 1500 5700 1580 5780 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2850 5700 105 105 2850 5700 2955 5805 1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 4350 5700 105 105 4350 5700 4455 5805 4 0 -1 0 0 0 12 0.0000 2 180 765 4575 5775 duplicate\001 4 0 -1 0 0 0 12 0.0000 2 135 1035 3075 5775 blocked task\001 4 0 -1 0 0 0 12 0.0000 2 135 870 1650 5775 active task\001 -6 6 3300 1500 3600 1800 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 1650 105 105 3450 1650 3555 1650 4 1 -1 0 0 0 10 0.0000 2 105 90 3450 1710 d\001 6 4200 2100 4500 2400 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2250 105 105 4350 2250 4455 2355 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 2310 d\001 -6 6 1350 4650 5325 4950 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 1500 4800 80 80 1500 4800 1580 4880 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2850 4800 105 105 2850 4800 2955 4905 1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 4350 4800 105 105 4350 4800 4455 4905 4 0 -1 0 0 0 12 0.0000 2 180 765 4575 4875 duplicate\001 4 0 -1 0 0 0 12 0.0000 2 135 1035 3075 4875 blocked task\001 4 0 -1 0 0 0 12 0.0000 2 135 870 1650 4875 active task\001 6 4200 1800 4500 2100 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1950 105 105 4350 1950 4455 2055 4 1 -1 0 0 0 10 0.0000 2 105 90 4350 2010 b\001 -6 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1650 2850 105 105 1650 2850 1755 2955 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1950 2850 105 105 1950 2850 2055 2955 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4950 3150 105 105 4950 3150 5055 3255 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5250 3150 105 105 5250 3150 5355 3255 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1950 105 105 4350 1950 4455 2055 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1650 105 105 4350 1650 4455 1755 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3450 3825 80 80 3450 3825 3530 3905 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 1950 105 105 3450 1950 3555 1950 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1650 3750 105 105 1650 3750 1755 3855 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1950 3750 105 105 1950 3750 2055 3855 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4950 4050 105 105 4950 4050 5055 4155 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5250 4050 105 105 5250 4050 5355 4155 1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3450 4725 80 80 3450 4725 3530 4805 1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 2850 105 105 3450 2850 3555 2850 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2850 105 105 4350 2850 4455 2955 1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 2550 105 105 4350 2550 4455 2655 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2400 2100 2625 2250 2400 3000 2625 3150 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 3300 2100 3525 2250 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 4200 2100 4425 2250 3300 3000 3525 3150 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 5 1500 2400 2100 2400 2100 2100 2400 2100 2400 1500 1500 3300 2100 3300 2100 3000 2400 3000 2400 2400 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 1500 3000 2100 3000 2100 3300 1500 3300 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3 1500 2700 2100 2700 2250 2925 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2100 2400 1950 2625 1500 3900 2100 3900 2100 4200 1500 4200 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3 1500 3600 2100 3600 2250 3825 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2100 3300 1950 3525 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3 1500 4500 2100 4500 2250 4725 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 2100 4200 1950 4425 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 1500 3900 2100 3900 2100 4200 3300 4200 1500 4800 2100 4800 2100 5100 3300 5100 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 4800 3000 4650 3225 4800 3900 4650 4125 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2 3300 4200 3525 4350 3300 5100 3525 5250 2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5 4200 4350 4200 3450 2700 3450 2700 4350 4200 4350 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 3600 1500 3600 2100 4200 2100 4200 900 2700 2400 2700 3000 3300 3000 3300 2400 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 3600 2400 3600 3000 4050 3000 4050 1800 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9 3600 4200 4800 4200 4800 3300 5400 3300 5400 3000 4800 3000 4800 2100 4500 2100 4500 900 2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5 4200 3450 4200 2550 2700 2550 2700 3450 4200 3450 2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4 2700 1500 2700 2100 3300 2100 3300 1500 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 1995 a\001 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 1695 c\001 4 1 -1 0 0 0 12 0.0000 2 135 315 3450 4575 exit\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 2325 A\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 4125 condition\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 4350 B\001 4 0 -1 0 0 0 12 0.0000 2 135 420 4950 2925 stack\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 2475 acceptor/\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 2700 signalled\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 2100 condition\001 4 1 4 0 0 0 12 0.0000 2 135 135 2550 1425 X\001 4 1 4 0 0 0 12 0.0000 2 135 135 3450 1425 Y\001 4 1 -1 0 0 0 12 0.0000 2 165 420 4350 600 entry\001 4 1 -1 0 0 0 12 0.0000 2 135 495 4350 825 queue\001 4 0 -1 0 0 0 12 0.0000 2 135 525 4650 1650 arrival\001 4 0 -1 0 0 0 12 0.0000 2 135 630 4650 1425 order of\001 4 1 -1 0 0 0 12 0.0000 2 135 525 3450 2925 shared\001 4 1 -1 0 0 0 12 0.0000 2 135 735 3450 3225 variables\001 4 1 -1 0 0 0 12 0.0000 2 120 510 3000 975 mutex\001 4 1 -1 0 0 0 10 0.0000 2 75 75 3450 1995 c\001 4 1 -1 0 0 0 12 0.0000 2 135 570 3000 1200 queues\001 4 0 -1 0 0 3 12 0.0000 2 150 540 4950 3525 urgent\001 3600 5100 4800 5100 4800 4200 5400 4200 5400 3900 4800 3900 4800 3000 4500 3000 4500 1800 2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2 4050 3000 4500 3150 4 1 -1 0 0 0 12 0.0000 2 135 315 3450 5475 exit\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 3225 A\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 5025 condition\001 4 1 -1 0 0 0 12 0.0000 2 135 135 1650 5250 B\001 4 0 -1 0 0 0 12 0.0000 2 135 420 4950 3825 stack\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 3375 acceptor/\001 4 0 -1 0 0 0 12 0.0000 2 180 750 4950 3600 signalled\001 4 1 -1 0 0 0 12 0.0000 2 135 795 1650 3000 condition\001 4 1 4 0 0 0 12 0.0000 2 135 135 2550 2325 X\001 4 1 4 0 0 0 12 0.0000 2 135 135 3450 2325 Y\001 4 1 -1 0 0 0 12 0.0000 2 135 525 3450 3825 shared\001 4 1 -1 0 0 0 12 0.0000 2 135 735 3450 4125 variables\001 4 1 -1 0 0 0 10 0.0000 2 75 75 3450 2895 c\001 4 1 -1 0 0 0 12 0.0000 2 165 1125 3000 2100 mutex queues\001 4 0 -1 0 0 3 12 0.0000 2 150 540 4950 4425 urgent\001 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 2895 a\001 4 1 -1 0 0 0 10 0.0000 2 75 75 4350 2595 c\001 4 0 -1 0 0 0 12 0.0000 2 135 525 4650 2550 arrival\001 4 0 -1 0 0 0 12 0.0000 2 135 630 4650 2325 order of\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2025 X\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2325 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2625 Y\001 4 0 4 50 -1 0 11 0.0000 2 120 135 4075 2925 X\001 4 1 -1 0 0 0 12 0.0000 2 165 960 4275 1725 entry queue\001
Note: See TracChangeset for help on using the changeset viewer.