# Changeset bd12159 for doc

Ignore:
Timestamp:
Jun 17, 2019, 3:35:22 PM (2 years ago)
Branches:
arm-eh, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr
Children:
89faa82
Parents:
b4d34fa
Message:

complete draft for second version of concurrency paper

Location:
doc/papers/concurrency
Files:
3 edited

### Legend:

Unmodified
 rb4d34fa \usepackage{epic,eepic} \usepackage{xspace} \usepackage{enumitem} \usepackage{comment} \usepackage{upquote}                                            % switch curled '" to straight \usepackage{dcolumn}                                            % align decimal points in tables \usepackage{capt-of} \setlength{\multicolsep}{6.0pt plus 2.0pt minus 1.5pt} \hypersetup{breaklinks=true} \abstract[Summary]{ \CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language. This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime. These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like C pthreads. This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime written in \CFA. These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features like pthreads. \CFA introduces modern language-level control-flow mechanisms, like generators, coroutines, user-level threading, and monitors for mutual exclusion and synchronization. % Library extension for executors, futures, and actors are built on these basic mechanisms. The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and optional monitor barging. The runtime provides significant programmer simplification and safety by eliminating spurious wakeup and monitor barging. The runtime also ensures multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), and this feature is fully integrated with all monitor synchronization mechanisms. All control-flow features integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers. \section{Introduction} This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA~\cite{Moss18} and its runtime. This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA~\cite{Moss18,Cforall} and its runtime, which is written entirely in \CFA. \CFA is a modern, polymorphic, non-object-oriented\footnote{ \CFA has features often associated with object-oriented programming languages, such as constructors, destructors, virtuals and simple inheritance. However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions (member/method) implemented by an implicit \lstinline@this@ (receiver) parameter.}, backwards-compatible extension of the C programming language. In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control flow capabilities on top of a highly popular programming language allowing immediate dissemination. Within the \CFA framework, new control-flow features are created from scratch. ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}. However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}. Furthermore, \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone; In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control-flow capabilities on top of a highly popular programming language allowing immediate dissemination. Within the \CFA framework, new control-flow features are created from scratch because ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}. However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone; no high-level language concurrency features are defined. Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach. The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}. As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}. Finally, performant user-threading implementations (both time and space) are largely competitive with direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety. Finally, performant user-threading implementations (both time and space) meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety. A further effort over the past two decades is the development of language memory-models to deal with the conflict between language features and compiler/hardware optimizations, i.e., some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}. The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++} } and coroutines. Finally, solutions in the language allows matching constructs with language paradigm, i.e., imperative and functional languages have different presentations of the same concept. Finally, language solutions allow matching constructs with language paradigm, i.e., imperative and functional languages often have different presentations of the same concept to fit their programming model. Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary. Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signals-as-hints~\cite[\S~8]{Buhr05a}, where one is a consequence of the other, i.e., once there is spurious wakeup, signals-as-hints follows. However, spurious wakeup is \emph{not} a foundational concurrency property~\cite[\S~8]{Buhr05a}, it is a performance design choice. Similarly, signals-as-hints is also a performance decision. Similarly, signals-as-hints is often a performance decision. We argue removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism and matches with programmer expectation. (Experience by the authors teaching concurrency is that students are highly confused by these semantics.) (Authors experience teaching concurrency is that students are highly confused by these semantics.) Clawing back performance, when local non-determinism is unimportant, should be an option not the default. \begin{comment} For example, it is possible to provide exceptions, coroutines, monitors, and tasks as specialized types in an object-oriented language, integrating these constructs to allow leveraging the type-system (static type-checking) and all other object-oriented capabilities~\cite{uC++}. It is also possible to leverage call/return for blocking communication via new control structures, versus switching to alternative communication paradigms, like channels or message passing. As well, user threading is often a complementary feature, allowing light-weight threading to match with low-cost objects, while hiding the application/kernel boundary. User threading also allows layering of implicit concurrency models (no explicit thread creation), such executors, data-flow, actors, into a single language, so programmers can chose the model that best fits an algorithm.\footnote{ All implicit concurrency models have explicit threading in their implementation, and hence, can be build from explicit threading; however, the reverse is seldom true, i.e., given implicit concurrency, e.g., actors, it is virtually impossible to create explicit concurrency, e.g., blocking thread objects.} Finally, with extended language features and user-level threading it is possible to discretely fold locking and non-blocking I/O multiplexing into the language's I/O libraries, so threading implicitly dovetails with the I/O subsystem. \CFA embraces language extensions and user-level threading to provide advanced control-flow (exception handling\footnote{ \CFA exception handling will be presented in a separate paper. The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++} } and coroutines) and concurrency. Most augmented traditional (Fortran 18~\cite{Fortran18}, Cobol 14~\cite{Cobol14}, Ada 12~\cite{Ada12}, Java 11~\cite{Java11}) and new languages (Go~\cite{Go}, Rust~\cite{Rust}, and D~\cite{D}), except \CC, diverge from C with different syntax and semantics, only interoperate indirectly with C, and are not systems languages, for those with managed memory. As a result, there is a significant learning curve to move to these languages, and C legacy-code must be rewritten. \CFA embraces user-level threading, language extensions for advanced control-flow, and safety as the default. We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC, and other concurrent, imperative programming-languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms. We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in other concurrent, imperative programming-languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms. The main contributions of this work are: \begin{itemize} The name \lstinline|main| has special meaning in C, specifically the function where a program starts execution. Hence, overloading this name for other starting points (generator/coroutine/thread) is a logical extension.} which takes as its only parameter a reference to the generator type, called a \emph{generator main}. called a \emph{generator main},which takes as its only parameter a reference to the generator type. The generator main contains @suspend@ statements that suspend execution without ending the generator versus @return@. For the Fibonacci generator-main,\footnote{ Having to manually create the generator closure by moving local-state variables into the generator type is an additional programmer burden. (This restriction is removed by the coroutine, see Section~\ref{s:Coroutine}.) (This restriction is removed by the coroutine in Section~\ref{s:Coroutine}.) This requirement follows from the generality of variable-size local-state, \eg local state with a variable-length array requires dynamic allocation because the array size is unknown at compile time. However, dynamic allocation significantly increases the cost of generator creation/destruction and is a show-stopper for embedded real-time programming. Note, the cost of creating and resuming the device-driver generator, @Driver@, is virtually identical to call/return, so performance in an operating-system kernel is excellent. As well, the data state is small, where variables @byte@ and @msg@ are communication variables for passing in message bytes and returning the message, and variables @lnth@, @crc@, and @sum@ are local variable that must be retained between calls and are hoisted into the generator type. Manually, detecting and hoisting local-state variables is easy when the number is small. As well, the data state is small, where variables @byte@ and @msg@ are communication variables for passing in message bytes and returning the message, and variables @lnth@, @crc@, and @sum@ are local variable that must be retained between calls and are manually hoisted into the generator type. % Manually, detecting and hoisting local-state variables is easy when the number is small. Finally, the execution state is large, with one @resume@ and seven @suspend@s. Hence, the key benefits of the generator are correctness, safety, and maintenance because the execution states of the FSM are transcribed directly into the programming language rather than using a table-driven approach. Hence, the key benefits of the generator are correctness, safety, and maintenance because the execution states are transcribed directly into the programming language rather than using a table-driven approach. Because FSMs can be complex and occur frequently in important domains, direct support of the generator is crucial in a systems programming-language. Figure~\ref{f:CPingPongSim} shows the implementation of the symmetric generator, where the complexity is the @resume@, which needs an extension to the calling convention to perform a forward rather than backward jump. This jump starts at the top of the next generator main to re-execute the normal calling convention to make space on the stack for its local variables. However, before the jump, the caller must reset its stack (and any registers) equivalent to a @return@, but subsequently jump forward not backwards. However, before the jump, the caller must reset its stack (and any registers) equivalent to a @return@, but subsequently jump forward. This semantics is basically a tail-call optimization, which compilers already perform. Hence, assembly code is manually insert in the example to undo the generator's entry code before the direct jump. This assembly code is fragile as it depends on what entry code is generated, specifically if there are local variables and the level of optimization. The example shows the assembly code to undo the generator's entry code before the direct jump. This assembly code depends on what entry code is generated, specifically if there are local variables and the level of optimization. To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time. Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing. Finally, part of this generator work was inspired by the recent \CCtwenty generator proposal~\cite{C++20Coroutine19} (which they call coroutines). Our work provides the same high-performance asymmetric-generators as \CCtwenty, and extends their work with symmetric generators. An additional \CCtwenty generator feature allows @suspend@ and @resume@ to be followed by a restricted compound-statement that is executed after the current generator has reset its stack but before calling the next generator, specified with the following \CFA syntax. An additional \CCtwenty generator feature allows @suspend@ and @resume@ to be followed by a restricted compound-statement that is executed after the current generator has reset its stack but before calling the next generator, specified with \CFA syntax: \begin{cfa} ... suspend{ ... }; Stackful coroutines extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main. A coroutine is specified by replacing @generator@ with @coroutine@ for the type. This generality results in higher cost for creation, due to dynamic stack allocation, execution, due to context switching among stacks, and terminating, due to possible stack unwinding and dynamic stack deallocation. Coroutine generality results in higher cost for creation, due to dynamic stack allocation, execution, due to context switching among stacks, and terminating, due to possible stack unwinding and dynamic stack deallocation. A series of different kinds of coroutines and their implementation demonstrate how coroutines extend generators. Hence, if a coroutine's destructor detects the coroutine is not ended, it implicitly raises a cancellation exception (uncatchable exception) at the coroutine and resumes it so the cancellation exception can propagate to the root of the coroutine's stack destroying all local variable on the stack. So the \CFA semantics for the generator and coroutine, ensure both can be safely deallocated at any time, regardless of their current state, like any other aggregate object. Explicitly raising normal exceptions at another coroutine can replace flag variables, like @stop@, \eg @prod@ raises a @stop@ exception at @cons@ after it finishes generating values and resumes @cons@, which catches the @stop@exception to terminate its loop. Explicitly raising normal exceptions at another coroutine can replace flag variables, like @stop@, \eg @prod@ raises a @stop@ exception at @cons@ after it finishes generating values and resumes @cons@, which catches the @stop@ exception to terminate its loop. Finally, there is an interesting effect for @suspend@ with symmetric coroutines. \subsection{(Generator) Coroutine Implementation} A significant implementation challenge for generators/coroutines (and threads, see Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack. There are several solutions to these problem, which follow from the object-oriented-flavoured choice to adopt custom types. A significant implementation challenge for generators/coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack. There are several solutions to these problem, which follow from the object-oriented flavour of adopting custom types. For object-oriented languages, inheritance is used to provide extra fields and code via explicit inheritance: As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors. An alternative is composition: \begin{cfa} The downside of this approach is that it makes custom types a special case in the language. Users wanting to extend custom types or build their own can only do so in ways offered by the language. Furthermore, implementing custom types without language supports may display the power of a programming language. \CFA blends the two approaches, providing custom type for idiomatic \CFA code, extending and building new custom types is still possible, similar to Java approach with builtin and library concurrency. Part of the mechanism to generalize custom types is the \CFA trait, \eg the definition for custom-type @coroutine@ is anything satisfying the trait @is_coroutine@, and this trait both enforces and restricts the coroutine-interface functions. Furthermore, implementing custom types without language support may display the power of a programming language. \CFA blends the two approaches, providing custom type for idiomatic \CFA code, while extending and building new custom types is still possible, similar to Java concurrency with builtin and library. Part of the mechanism to generalize custom types is the \CFA trait~\cite[\S~2.3]{Moss18}, \eg the definition for custom-type @coroutine@ is anything satisfying the trait @is_coroutine@, and this trait both enforces and restricts the coroutine-interface functions. \begin{cfa} trait is_coroutine( dtype T ) { coroutine_desc * get_coroutine( T & ); }; forall( dtype T | is_coroutine(T) ) void $suspend$( T & ); forall( dtype T | is_coroutine(T) ) void resume( T & ); forall( dtype T | is_coroutine(T) ) void $suspend$( T & ), resume( T & ); \end{cfa} Note, copying generators/coroutines/threads is not meaningful. Furthermore, two coroutines cannot logically execute on the same stack. A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy. Now, the @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer). The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer). The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones. The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining @suspend@ and @resume@. The \CFA custom-type @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main: The \CFA custom-type @coroutine@ implicitly implements the getter and forward declarations for the coroutine main. \begin{cquote} \begin{tabular}{@{}ccc@{}} \end{tabular} \end{cquote} The combination of custom types and fundamental @trait@ description of these types allows an concise specification for programmers and tools, while more advanced programmers can have tighter control over memory layout and initialization. The combination of custom types and fundamental @trait@ description of these types allows a concise specification for programmers and tools, while more advanced programmers can have tighter control over memory layout and initialization. Figure~\ref{f:CoroutineMemoryLayout} shows different memory-layout options for a coroutine (where a task is similar). The coroutine handle is the @coroutine@ instance containing programmer specified global/communication variables across interface functions. The coroutine handle is the @coroutine@ instance containing programmer specified type global/communication variables across interface functions. The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate. The coroutine stack can appear in a number of locations and fixed or variable sized. The coroutine stack can appear in a number of locations and be fixed or variable sized. Hence, the coroutine's stack could be a VLS\footnote{ We are examining variable-sized structures (VLS), where fields can be variable-sized structures or arrays. For heap stack allocation, allocation/deallocation is an expensive heap allocation (where the heap can be a shared resource), modulo any stack constructor costs. With heap stack allocation, it is also possible to use a split (segmented) stack calling-convention, available with gcc and clang, so the stack is variable sized. Currently, \CFA supports stack/heap allocated descriptors but only fixed-sized heap allocated stacks; Currently, \CFA supports stack/heap allocated descriptors but only fixed-sized heap allocated stacks. In \CFA debug-mode, the fixed-sized stack is terminated with a write-only page, which catches most stack overflows. Experience teaching concurrency with \uC~\cite{CS343} shows fixed-sized stacks are rarely an issue for students. Split-stack allocation is under development but requires recompilation of existing code, which may be impossible. Split-stack allocation is under development but requires recompilation of legacy code, which may be impossible. \begin{figure} However, coroutines are a stepping stone towards concurrency. The transition to concurrency, even for a single thread with multiple stacks, occurs when coroutines context switch to a \newterm{scheduling coroutine}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}. The transition to concurrency, even for a single thread with multiple stacks, occurs when coroutines context switch to a \newterm{scheduling coroutine}, introducing non-determinism from the coroutine perspective~\cite[\S~3,]{Buhr05a}\cite{Adya02}. Therefore, a minimal concurrency system requires coroutines \emph{in conjunction with a nondeterministic scheduler}. The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}. For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch. For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches. A stackful scheduler is often used for simplicity and security. The \CFA runtime uses a stackful scheduler for uniformity and security. \begin{tabular}{@{}lll@{}} \multicolumn{1}{c}{\textbf{Java}} & \multicolumn{1}{c}{\textbf{\Celeven}} & \multicolumn{1}{c}{\textbf{pthreads}} \\ \begin{cfa}[aboveskip=0pt,belowskip=0pt] \begin{cfa} class MyTask extends Thread {...} mytask t = new MyTask(...); \end{cfa} & \begin{cfa}[aboveskip=0pt,belowskip=0pt] \begin{cfa} class MyTask { ... } // functor MyTask mytask; \end{cfa} & \begin{cfa}[aboveskip=0pt,belowskip=0pt] \begin{cfa} void * rtn( void * arg ) {...} pthread_t t;  int i = 3; \subsection{Thread Implementation} Threads in \CFA are user level run by runtime kernel threads (see Section~\ref{s:Parallelism}), where user threads provide concurrency and kernel threads provide parallelism. Threads in \CFA are user level run by runtime kernel threads (see Section~\ref{s:CFARuntimeStructure}), where user threads provide concurrency and kernel threads provide parallelism. Like coroutines, and for the same design reasons, \CFA provides a custom @thread@ type and a @trait@ to enforce and restrict the task-interface functions. \begin{cquote} Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing. (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.) The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack starts running the coroutine main on the stack; The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack and starts running the coroutine main on the stack; whereas, a thread is scheduling for execution in @main@ immediately after its constructor is run. No return value or additional parameters are necessary for this function because the @thread@ type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values. \begin{comment} % put in appendix with coroutine version ??? As such the @main@ function of a thread can be defined as \begin{cfa} thread foo {}; void main(foo & this) { sout | "Hello World!"; } \end{cfa} In this example, threads of type @foo@ start execution in the @void main(foo &)@ function, which prints @"Hello World!".@ While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the function-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously. \begin{cfa} typedef void (*voidRtn)(int); thread RtnRunner { voidRtn func; int arg; }; void ?{}(RtnRunner & this, voidRtn inRtn, int arg) { this.func = inRtn; this.arg  = arg; } void main(RtnRunner & this) { // thread starts here and runs the function this.func( this.arg ); } void hello(/*unused*/ int) { sout | "Hello World!"; } int main() { RtnRunner f = {hello, 42}; return 0? } \end{cfa} A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{API}. \end{comment} \section{Mutual Exclusion / Synchronization} Uncontrolled nondeterministic execution produces meaningless results. To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access-control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. Unrestricted nondeterminism is meaningless as there is no way to know when the result is completed without synchronization. To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. Some concurrent systems eliminate mutable shared-state by switching to stateless communication like message passing~\cite{Thoth,Harmony,V-Kernel,MPI} (Erlang, MPI), channels~\cite{CSP} (CSP,Go), actors~\cite{Akka} (Akka, Scala), or functional techniques (Haskell). However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm between regular and concurrent computation, \eg function call versus message passing. However, these approaches introduce a new communication mechanism for concurrency different from the standard communication using function call/return. Hence, a programmer must learn and manipulate two sets of design/programming patterns. While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account. The generalization is called a \newterm{group critical-section}~\cite{Joung00}, where multiple tasks with the same session may use the resource simultaneously, but different sessions may not use the resource simultaneously. The readers/writer problem~\cite{Courtois71} is an instance of a group critical-section, where readers share a session but writers have a unique session. \newterm{Mutual exclusion} enforces the correct kind and number of threads use a critical section. \newterm{Mutual exclusion} enforces the correct kind and number of threads using a critical section. However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use. A \textbf{monitor} is a set of functions that ensure mutual exclusion when accessing shared state. More precisely, a monitor is a programming technique that binds mutual exclusion to function scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation). The strong association with the call/return paradigm eases programming, comprehension, and maintenance, at a slight cost in flexibility and efficiency. More precisely, a monitor is a programming technique that implicitly binds mutual exclusion to static function scope, as opposed to locks, where mutual-exclusion is defined by acquire/release calls, independent of lexical context (analogous to block and heap storage allocation). Restricting acquire/release points eases programming, comprehension, and maintenance, at a slight cost in flexibility and efficiency. \CFA uses a custom @monitor@ type and leverages declaration semantics (deallocation) to protect active or waiting threads in a monitor. The following is a \CFA monitor implementation of an atomic counter. \newpage \begin{cfa}[morekeywords=nomutex] monitor Aint { int cnt; }; $\C[4.25in]{// atomic integer counter}$ int ++?( Aint & mutex$$$_{opt}$$$ this ) with( this ) { return ++cnt; } $\C{// increment}$ int ?=?( Aint & mutex$$$_{opt}$$$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions}\CRT$ int ?=?( Aint & mutex$$$_{opt}$$$ lhs, int rhs ) with( lhs ) { cnt = rhs; } $\C{// conversions with int}\CRT$ int ?=?( int & lhs, Aint & mutex$$$_{opt}$$$ rhs ) with( rhs ) { lhs = cnt; } \end{cfa} } \end{cfa} \CFA monitors also ensure the monitor lock is always released regardless of how an acquiring function ends (normal or exceptional), and returning a shared variable is safe via copying before the lock is released. Similar safety is offered by explicit mechanisms like \CC RAII; monitor implicit safety ensures no programmer error. Furthermore, RAII mechansims cannot handle complex synchronization within a monitor, where the monitor lock may not be released on function exit but passed to an unblocking thread. \CFA monitors also ensure the monitor lock is released regardless of how an acquiring function ends (normal or exceptional), and returning a shared variable is safe via copying before the lock is released. Similar safety is offered by \emph{explicit} mechanisms like \CC RAII; monitor \emph{implicit} safety ensures no programmer usage errors. Furthermore, RAII mechansims cannot handle complex synchronization within a monitor, where the monitor lock may not be released on function exit because it is passed to an unblocking thread; RAII is purely a mutual-exclusion mechanism (see Section~\ref{s:Scheduling}). \end{tabular} \end{cquote} Like before, the @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference (pointer). The @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference (pointer). % Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. % Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data. The custom monitor type also inserts any locking vehicles needed to implement the monitor locking semnatics. The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics. The benefit of mandatory monitor qualifiers is self-documentation, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameters is redundant. Instead, the semantics have one qualifier as the default, and the other required. Instead, the semantics have one qualifier as the default and the other required. For example, make the safe @mutex@ qualifier the default because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. Alternatively, make the unsafe \lstinline[morekeywords=nomutex]@nomutex@ qualifier the default because it is the \emph{normal} parameter semantics while @mutex@ parameters are rare. Providing a default qualifier implies knowing whether a parameter is a monitor. Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. Since \CFA relies heavily on traits as an abstraction mechanism, types can coincidentally match the monitor trait but not be a monitor, similar to inheritance where a shape and playing card can both be drawable. For this reason, \CFA requires programmers to identify the kind of parameter with the @mutex@ keyword and uses no keyword to mean \lstinline[morekeywords=nomutex]@nomutex@. \newpage The next semantic decision is establishing which parameter \emph{types} may be qualified with @mutex@. The following has monitor parameter types that are composed of multiple objects. A positive consequence of this design decision is the ability to support multi-monitor functions,\footnote{ While object-oriented monitors can be extended with a mutex qualifier for multiple-monitor members, no prior example of this feature could be found.} called \newterm{bulk acquire} (see Section~\ref{s:Implementation} for implementation details). called \newterm{bulk acquire}. \CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock. Figure~\ref{f:BankTransfer} shows a trivial solution to the bank-account transfer problem using \CFA monitors with implicit locking and \CC with explicit locking. Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking. A \CFA programmer only has to manage when to acquire mutual exclusion; a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options. \end{figure} Users can force the acquiring order, by using @mutex@/\lstinline[morekeywords=nomutex]@nomutex@. Users can still force the acquiring order by using @mutex@/\lstinline[morekeywords=nomutex]@nomutex@. \newpage \begin{cfa} void foo( M & mutex m1, M & mutex m2 ); $\C{// acquire m1 and m2}$ \end{cfa} The bulk-acquire semantics allow @bar@ or @baz@ to acquire a monitor lock and reacquire it in @foo@. In the calls to @bar@ and @baz@, the monitors are acquired in opposite order, resulting in deadlock. In the calls to @bar@ and @baz@, the monitors are acquired in opposite order, possibly resulting in deadlock. However, this case is the simplest instance of the \emph{nested-monitor problem}~\cite{Lister77}, where monitors are acquired in sequence versus bulk. Detecting the nested-monitor problem requires dynamic tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. \CFA does not deal with this fundamental problem. \subsection{\protect\lstinline@mutex@ statement} \label{mutex-stmt} The monitor call-semantics associate all locking semantics with functions. Like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming. Finally, like Java, \CFA offers an alternative @mutex@ statement to reduce refactoring and naming. \begin{cquote} \renewcommand{\arraystretch}{0.0} \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}} \multicolumn{1}{c}{\textbf{\lstinline@mutex@ call}} & \multicolumn{1}{c}{\lstinline@mutex@ \textbf{statement}} \\ \begin{cfa} \begin{cfa}[aboveskip=0pt,belowskip=0pt] monitor M { ... }; void foo( M & mutex m1, M & mutex m2 ) { \end{cfa} & \begin{cfa} \begin{cfa}[aboveskip=0pt,belowskip=0pt] void bar( M & m1, M & m2 ) { \label{s:Scheduling} % There are many aspects of scheduling in a concurrency system, all related to resource utilization by waiting threads, \ie which thread gets the resource next. % Different forms of scheduling include access to processors by threads (see Section~\ref{s:RuntimeStructureCluster}), another is access to a shared resource by a lock or monitor. This section discusses monitor scheduling for waiting threads eligible for entry, \ie which thread gets the shared resource next. (See Section~\ref{s:RuntimeStructureCluster} for scheduling threads on virtual processors.) While monitor mutual-exclusion provides safe access to shared data, the monitor data may indicate that a thread accessing it cannot proceed, \eg a bounded buffer may be full/empty so produce/consumer threads must block. Leaving the monitor and trying again (busy waiting) is impractical for high-level programming. Monitors eliminate busy waiting by providing synchronization to schedule threads needing access to the shared data, where threads block versus spinning. Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling, where \newterm{scheduling} defines which thread acquires the critical section next. Synchronization is generally achieved with internal~\cite{Hoare74} or external~\cite[\S~2.9.2]{uC++} scheduling. \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry. Finally, \CFA monitors do not allow calling threads to barge ahead of signalled threads, which simplifies synchronization among threads in the monitor and increases correctness. If barging is allowed, synchronization between a signaller and signallee is difficult, often requiring additional flags and multiple unblock/block cycles. In fact, signals-as-hints is completely opposite from that proposed by Hoare in the seminal paper on monitors: \begin{cquote} However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74} \end{cquote} In fact, signals-as-hints is completely opposite from that proposed by Hoare in the seminal paper on monitors~\cite[p.~550]{Hoare74}. % \begin{cquote} % However, we decree that a signal operation be followed immediately by resumption of a waiting program, without possibility of an intervening procedure call from yet a third program. % It is only in this way that a waiting program has an absolute guarantee that it can acquire the resource just released by the signalling program without any danger that a third program will interpose a monitor entry and seize the resource instead.~\cite[p.~550]{Hoare74} % \end{cquote} Furthermore, \CFA concurrency has no spurious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implicit form of barging. Hence, a \CFA @wait@ statement is not enclosed in a @while@ loop that can cause thread starvation. Figure~\ref{f:BBInt} shows a \CFA generic bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@. Hence, a \CFA @wait@ statement is not enclosed in a @while@ loop retesting a blocking predicate, which can cause thread starvation due to barging. Figure~\ref{f:MonitorScheduling} shows internal/external scheduling (for the bounded-buffer example in Figure~\ref{f:InternalExternalScheduling}). External calling threads block on the calling queue, if the monitor is occupied, otherwise they enter in FIFO order. Internal threads block on condition queues via @wait@ and they reenter from the condition in FIFO order. There are three signalling mechanisms to unblock waiting threads to enter the monitor. Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion so only can proceed. For internal scheduling, threads are unblocked from condition queues using @signal@, where the signallee is moved to urgent and the signaller continues (solid line). Multiple signals move multiple signallees to urgent, until the condition is empty. When the signaller exits or waits, a thread blocked on urgent is processed before calling threads to prevent barging. (Java conceptually moves the signalled thread to the calling queue, and hence, allows barging.) The alternative unblock is in the opposite order using @signal_block@, where the signaller is moved to urgent and the signallee continues (dashed line), and is implicitly unblocked from urgent when the signallee exits or waits. For external scheduling, the condition queues are not used; instead threads are unblocked directly from the calling queue using @waitfor@ based on function names requesting mutual exclusion. (The linear search through the calling queue to locate a particular call can be reduced to $O(1)$.) The @waitfor@ has the same semantics as @signal_block@, where the signalled thread executes before the signallee, which waits on urgent. Executing multiple @waitfor@s from different signalled functions causes the calling threads to move to urgent. External scheduling requires urgent to be a stack, because the signaller excepts to execute immediately after the specified monitor call has exited or waited. Internal scheduling behaves the same for an urgent stack or queue, except for signalling multiple threads, where the threads unblock from urgent in reverse order from signalling. If the restart order is important, multiple signalling by a signal thread can be transformed into shared signalling among threads, where each thread signals the next thread. Hence, \CFA uses an urgent stack. \begin{figure} \centering % \subfloat[Scheduling Statements] { % \label{fig:SchedulingStatements} % {\resizebox{0.45\textwidth}{!}{\input{CondSigWait.pstex_t}}} \input{CondSigWait.pstex_t} % }% subfloat % \quad % \subfloat[Bulk acquire monitor] { % \label{fig:BulkMonitor} % {\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}} % }% subfloat \caption{Monitor Scheduling} \label{f:MonitorScheduling} \end{figure} Figure~\ref{f:BBInt} shows a \CFA generic bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition variable, @full@/@empty@. The @wait@ function atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the function's parameter list. The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer. Signalling is unconditional, because signalling an empty condition lock does nothing. Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means: \begin{enumerate} \item The signalling thread returns immediately, and the signalled thread continues. \item The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait). \item The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues. \end{enumerate} The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}). \CFA supports the next two semantics as both are useful. Finally, while it is common to store a @condition@ as a field of the monitor, in \CFA, a @condition@ variable can be created/stored independently. Furthermore, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors. The appropriate condition variable is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer. Signalling is unconditional, because signalling an empty condition variable does nothing. It is common to declare condition variables as monitor fields to prevent shared access, hence no locking is required for access as the conditions are protected by the monitor lock. In \CFA, a condition variable can be created/stored independently. To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolen test to detect sharing from other monitors. % Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means: % \begin{enumerate} % \item % The signalling thread returns immediately and the signalled thread continues. % \item % The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait). % \item % The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues. % \end{enumerate} % The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}). % \CFA supports the next two semantics as both are useful. \begin{figure} }; void ?{}( Buffer(T) & buffer ) with(buffer) { [front, back, count] = 0; front = back = count = 0; } void insert( Buffer(T) & mutex buffer, T elem ) with(buffer) { \end{lrbox} % \newbox\myboxB % \begin{lrbox}{\myboxB} % \begin{cfa}[aboveskip=0pt,belowskip=0pt] % forall( otype T ) { // distribute forall %       monitor Buffer { % %               int front, back, count; %               T elements[10]; %       }; %       void ?{}( Buffer(T) & buffer ) with(buffer) { %               [front, back, count] = 0; %       } %       T remove( Buffer(T) & mutex buffer ); // forward %       void insert( Buffer(T) & mutex buffer, T elem ) %                               with(buffer) { %               if ( count == 10 ) waitfor( remove, buffer ); %               // insert elem into buffer % %       } %       T remove( Buffer(T) & mutex buffer ) with(buffer) { %               if ( count == 0 ) waitfor( insert, buffer ); %               // remove elem from buffer % %               return elem; %       } % } % \end{cfa} % \end{lrbox} \newbox\myboxB \begin{lrbox}{\myboxB} \begin{cfa}[aboveskip=0pt,belowskip=0pt] forall( otype T ) { // distribute forall monitor Buffer { int front, back, count; T elements[10]; }; void ?{}( Buffer(T) & buffer ) with(buffer) { [front, back, count] = 0; } T remove( Buffer(T) & mutex buffer ); // forward void insert( Buffer(T) & mutex buffer, T elem ) with(buffer) { if ( count == 10 ) waitfor( remove, buffer ); // insert elem into buffer } T remove( Buffer(T) & mutex buffer ) with(buffer) { if ( count == 0 ) waitfor( insert, buffer ); // remove elem from buffer return elem; } } monitor ReadersWriter { int rcnt, wcnt; // readers/writer using resource }; void ?{}( ReadersWriter & rw ) with(rw) { rcnt = wcnt = 0; } void EndRead( ReadersWriter & mutex rw ) with(rw) { rcnt -= 1; } void EndWrite( ReadersWriter & mutex rw ) with(rw) { wcnt = 0; } void StartRead( ReadersWriter & mutex rw ) with(rw) { if ( wcnt > 0 ) waitfor( EndWrite, rw ); rcnt += 1; } void StartWrite( ReadersWriter & mutex rw ) with(rw) { if ( wcnt > 0 ) waitfor( EndWrite, rw ); else while ( rcnt > 0 ) waitfor( EndRead, rw ); wcnt = 1; } \end{cfa} \end{lrbox} \subfloat[Internal scheduling]{\label{f:BBInt}\usebox\myboxA} %\qquad \subfloat[External scheduling]{\label{f:BBExt}\usebox\myboxB} \caption{Generic bounded-buffer} \label{f:GenericBoundedBuffer} \subfloat[Generic bounded buffer, internal scheduling]{\label{f:BBInt}\usebox\myboxA} \hspace{3pt} \vrule \hspace{3pt} \subfloat[Readers / writer lock, external scheduling]{\label{f:RWExt}\usebox\myboxB} \caption{Internal / external scheduling} \label{f:InternalExternalScheduling} \end{figure} Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/\-empty buffer block and prevent more producers/consumers from entering the monitor until there is a free/empty slot in the buffer. Figure~\ref{f:BBInt} can be transformed into external scheduling by removing the condition variables and signals/waits, and adding the following lines at the locations of the current @wait@s in @insert@/@remove@, respectively. \begin{cfa}[aboveskip=2pt,belowskip=1pt] if ( count == 10 ) waitfor( remove, buffer );       |      if ( count == 0 ) waitfor( insert, buffer ); \end{cfa} Here, the producers/consumers detects a full/\-empty buffer and prevents more producers/consumers from entering the monitor until there is a free/empty slot in the buffer. External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion. If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. Threads making calls to functions that are currently excluded, block outside of (external to) the monitor on a calling queue, versus blocking on condition queues inside of (internal to) the monitor. External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. Threads making calls to functions that are currently excluded block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor. Figure~\ref{f:RWExt} shows a readers/writer lock written using external scheduling, where a waiting reader detects a writer using the resource and restricts further calls until the writer exits by calling @EndWrite@. The writer does a similar action for each reader or writer using the resource. Note, no new calls to @StarRead@/@StartWrite@ may occur when waiting for the call to @EndRead@/@EndWrite@. External scheduling allows waiting for events from other threads while restricting unrelated events. The mechnaism can be done in terms of control flow, \eg Ada @accept@ or \uC @_Accept@, or in terms of data, \eg Go @select@ on channels. While both mechanisms have strengths and weaknesses, this project uses a control-flow mechanism to stay consistent with other language semantics. Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor functions (see Section~\ref{s:Multi-MonitorScheduling}). For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; the signaller enters the monitor and changes state, detects a waiting threads that can use the state, performs a non-blocking signal on the condition queue for the waiting thread, and exits the monitor to run concurrently. The waiter unblocks next from the urgent queue, uses/takes the state, and exits the monitor. Blocking signalling is the reverse, where the waiter is providing the cooperation for the signalling thread; the signaller enters the monitor, detects a waiting thread providing the necessary state, performs a blocking signal to place it on the urgent queue and unblock the waiter. The waiter changes state and exits the monitor, and the signaller unblocks next from the urgent queue to use/take the state. While both mechanisms have strengths and weaknesses, this project uses the control-flow mechanism to be consistent with other language features. % Two challenges specific to \CFA for external scheduling are loose object-definitions (see Section~\ref{s:LooseObjectDefinitions}) and multiple-monitor functions (see Section~\ref{s:Multi-MonitorScheduling}). Figure~\ref{f:DatingService} shows a dating service demonstrating non-blocking and blocking signalling. int GirlPhNo, BoyPhNo; condition Girls[CCodes], Boys[CCodes]; condition exchange; condition exchange; }; int girl( DS & mutex ds, int phNo, int ccode ) { signal( Boys[ccode] ); wait( exchange ); } // if } return BoyPhNo; } \end{figure} In summation, for internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; the signaller enters the monitor and changes state, detects a waiting threads that can use the state, performs a non-blocking signal on the condition queue for the waiting thread, and exits the monitor to run concurrently. The waiter unblocks next from the urgent queue, uses/takes the state, and exits the monitor. Blocking signalling is the reverse, where the waiter is providing the cooperation for the signalling thread; the signaller enters the monitor, detects a waiting thread providing the necessary state, performs a blocking signal to place it on the urgent queue and unblock the waiter. The waiter changes state and exits the monitor, and the signaller unblocks next from the urgent queue to use/take the state. Both internal and external scheduling extend to multiple monitors in a natural way. \begin{cquote} \end{tabular} \end{cquote} For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex types in the parameter list, \ie @wait( e, m1, m2 )@. For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex parameters, \ie @wait( e, m1, m2 )@. To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@. Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe. While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric. Finally, a signaller, \newpage \begin{cfa} void baz( M & mutex m1, M & mutex m2 ) { must have acquired at least the same locks as the waiting thread signalled from the condition queue. Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex parameters, \ie @waitfor( rtn, m1, m2 )@. To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@. @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer. While deadlock can occur with multiple/nesting acquisition, this is a consequence of locks, and by extension monitors, not being perfectly composable. % Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. \subsection{Barging Prevention} Figure~\ref{f:BargingPrevention} shows \CFA code where bulk acquire adds complexity to the internal-signalling semantics. \subsection{Extended \protect\lstinline@waitfor@} Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes. For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch. If there are multiple acceptable mutex calls, selection occurs top-to-bottom (prioritized) among the @waitfor@ clauses, whereas some programming languages with similar mechanisms accept nondeterministically for this case, \eg Go \lstinline[morekeywords=select]@select@. If some accept guards are true and there are no outstanding calls to these members, the acceptor is accept-blocked until a call to one of these members is made. If there is a @timeout@ clause, it provides an upper bound on waiting. If all the accept guards are false, the statement does nothing, unless there is a terminating @else@ clause with a true guard, which is executed instead. Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking. If both @timeout@ and @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered. \begin{figure} \centering \begin{cfa} when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ waitfor( $\emph{mutex-member-name}$ ) $\emph{statement}$ $\C{// action after call}$ or when ( $\emph{conditional-expression}$ ) $\C{// any number of functions}$ waitfor( $\emph{mutex-member-name}$ ) $\emph{statement}$ or    ... when ( $\emph{conditional-expression}$ ) $\C{// optional guard}$ timeout $\emph{statement}$ $\C{// optional terminating timeout clause}$ when ( $\emph{conditional-expression}$ ) $\C{// optional guard}$ else  $\emph{statement}$ $\C{// optional terminating clause}$ \end{cfa} \caption{Extended \protect\lstinline@waitfor@} \label{f:ExtendedWaitfor} \end{figure} Note, a group of conditional @waitfor@ clauses is \emph{not} the same as a group of @if@ statements, e.g.: \begin{cfa} if ( C1 ) waitfor( mem1 );                       when ( C1 ) waitfor( mem1 ); else if ( C2 ) waitfor( mem2 );         or when ( C2 ) waitfor( mem2 ); \end{cfa} The left example only accepts @mem1@ if @C1@ is true or only @mem2@ if @C2@ is true. The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true. An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated. \begin{cfa} void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) { if ( count == 10 ) waitfor( remove, buffer ) { // insert elem into buffer } or waitfor( ^?{}, buffer ) throw insertFail; } \end{cfa} When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action. However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined. Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object. Accepting the destructor is an idiomatic way to terminate a thread in \CFA. \subsection{Bulk Barging Prevention} Figure~\ref{f:BulkBargingPrevention} shows \CFA code where bulk acquire adds complexity to the internal-signalling semantics. The complexity begins at the end of the inner @mutex@ statement, where the semantics of internal scheduling need to be extended for multiple monitors. The problem is that bulk acquire is used in the inner @mutex@ statement where one of the monitors is already acquired. When the signalling thread reaches the end of the inner @mutex@ statement, it should transfer ownership of @m1@ and @m2@ to the waiting threads to prevent barging into the outer @mutex@ statement by another thread. However, both the signalling and waiting thread W1 still need monitor @m1@. However, both the signalling and waiting threads W1 and W2 need some subset of monitors @m1@ and @m2@. \begin{cquote} condition c: (order 1) W2(@m2@), W1(@m1@,@m2@)\ \ \ or\ \ \ (order 2) W1(@m1@,@m2@), W2(@m2@) \\ S: acq. @m1@ $\rightarrow$ acq. @m1,m2@ $\rightarrow$ @signal(c)@ $\rightarrow$ rel. @m2@ $\rightarrow$ pass @m2@ unblock W2 (order 2) $\rightarrow$ rel. @m1@ $\rightarrow$ pass @m1,m2@ unblock W1 \\ \hspace*{2.75in}$\rightarrow$ rel. @m1@ $\rightarrow$ pass @m1,m2@ unblock W1 (order 1) \end{cquote} \begin{figure} mutex( m1, m2 ) { // $\LstCommentStyle{\color{red}inner}$ ... signal( c ); ... // m1, m2 acquired // m1, m2 still acquired } // $\LstCommentStyle{\color{red}release m2}$ // m1 acquired ... mutex( m1, m2 ) { ... wait( c ); // block and release m1, m2 // m1, m2 acquired ... wait( c ); // release m1, m2 // m1, m2 reacquired } // $\LstCommentStyle{\color{red}release m2}$ // m1 acquired mutex( m2 ) { ... wait( c ); ... // m2 acquired ... wait( c ); // release m2 // m2 reacquired } // $\LstCommentStyle{\color{red}release m2}$ \begin{cquote} \subfloat[Signalling Thread]{\label{f:SignallingThread}\usebox\myboxA} \subfloat[Signalling Thread (S)]{\label{f:SignallingThread}\usebox\myboxA} \hspace{3\parindentlnth} \subfloat[Waiting Thread (W1)]{\label{f:WaitingThread}\usebox\myboxB} \subfloat[Waiting Thread (W2)]{\label{f:OtherWaitingThread}\usebox\myboxC} \end{cquote} \caption{Barging Prevention} \label{f:BargingPrevention} \caption{Bulk Barging Prevention} \label{f:BulkBargingPrevention} \end{figure} One scheduling solution is for the signaller to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling. However, Figure~\ref{f:OtherWaitingThread} shows this solution is complex depending on other waiters, resulting in options when the signaller finishes the inner mutex-statement. The signaller can retain @m2@ until completion of the outer mutex statement and pass the locks to waiter W1, or it can pass @m2@ to waiter W2 after completing the inner mutex-statement, while continuing to hold @m1@. In the latter case, waiter W2 must eventually pass @m2@ to waiter W1, which is complex because W1 may have waited before W2, so W2 is unaware of it. Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves. While a number of approaches were examined~\cite[\S~4.3]{Delisle18}, the solution chosen for \CFA is a novel techique called \newterm{partial signalling}. Signalled threads are moved to the urgent queue and the waiter at the front defines the set of monitors necessary for it to unblock. Partial signalling transfers ownership of monitors to the front waiter. When the signaller thread exits or waits in the monitor, the front waiter is unblocked if all its monitors are released. The benefit is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. One scheduling solution is for the signaller S to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling. However, this solution is inefficient if W2 waited first and can be immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement. If W1 waited first, the signaller must retain @m1@ amd @m2@ until completion of the outer mutex statement and then pass both to W1. % Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves. To support this efficient semantics (and prevent barging), the implementation maintains a list of monitors acquired for each blocked thread. When a signaller exits or waits in a monitor function/statement, the front waiter on urgent is unblocked if all its monitors are released. Implementing a fast subset check for the necessary released monitors is important. % The benefit is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. In an object-oriented programming-language, a class includes an exhaustive list of operations. However, new members can be added via static inheritance or dynamic members, \eg JavaScript~\cite{JavaScript}. Similarly, monitor functions can be added at any time in \CFA, making it less clear for programmers and more difficult to implement. \begin{cfa} monitor M { ... }; void f( M & mutex m ); void g( M & mutex m ) { waitfor( f ); } $\C{// clear which f}$ void f( M & mutex m, int ); $\C{// different f}$ void h( M & mutex m ) { waitfor( f ); } $\C{// unclear which f}$ \end{cfa} Hence, the \CFA code for entering a monitor looks like: \begin{cfa} if ( $\textrm{\textit{monitor is free}}$ ) $\C{// enter}$ else if ( $\textrm{\textit{already own monitor}}$ ) $\C{// continue}$ else if ( $\textrm{\textit{monitor accepts me}}$ ) $\C{// enter}$ else $\C{// block}$ \end{cfa} For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions. However, a fast check for \emph{monitor accepts me} is much harder to implement depending on the constraints put on the monitors. Figure~\ref{fig:ClassicalMonitor} shows monitors are often expressed as an entry (calling) queue, some acceptor queues, and an urgent stack/queue. \begin{figure} \centering \subfloat[Classical monitor] { \label{fig:ClassicalMonitor} {\resizebox{0.45\textwidth}{!}{\input{monitor.pstex_t}}} }% subfloat \quad \subfloat[Bulk acquire monitor] { \label{fig:BulkMonitor} {\resizebox{0.45\textwidth}{!}{\input{ext_monitor.pstex_t}}} }% subfloat \caption{Monitor implementation} \label{f:MonitorImplementation} \end{figure} For a fixed (small) number of mutex functions (\eg 128), the accept check reduces to a bitmask of allowed callers, which can be checked with a single instruction. This approach requires a unique dense ordering of functions with a small upper-bound and the ordering must be consistent across translation units. For object-oriented languages these constraints are common, but \CFA mutex functions can be added in any scope and are only visible in certain translation unit, precluding program-wide dense-ordering among mutex functions. Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation. The mutex function called is associated with each thread on the entry queue, while a list of acceptable functions is kept separately. The accepted list is a variable-sized array of accepted function pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a (usually short) linear search. A new class can add members via static inheritance but the subclass still has an exhaustive list of operations. (Dynamic member adding, \eg JavaScript~\cite{JavaScript}, is not considered.) In the object-oriented scenario, the type and all its operators are always present at compilation (even separate compilation), so it is possible to number the operations in a bit mask and use an $O(1)$ compare with a similar bit mask created for the operations specified in a @waitfor@. In \CFA, monitor functions can be statically added/removed in translation units, so it is impossible to apply an $O(1)$ approach. \begin{cfa} monitor M { ... }; // common type, included in .h file translation unit 1 void f( M & mutex m ); void g( M & mutex m ) { waitfor( f, m ); } translation unit 2 void f( M & mutex m ); void g( M & mutex m ); void h( M & mutex m ) { waitfor( f, m ) or waitfor( g, m ); } \end{cfa} The @waitfor@ statements in each translation unit cannot form a unique bit-mask because the monitor type does not carry that information. Hence, function pointers are used to identify the functions listed in the @waitfor@ statement, stored in a variable-sized array, Then, the same implementation approach used for the urgent stack is used for the calling queue. Each caller has a list of monitors acquired, and the @waitfor@ statement performs a (usually short) linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transfers. (A possible way to construct a dense mapping is at link or load-time.) \label{f:UnmatchedMutexSets} \end{figure} \subsection{Extended \protect\lstinline@waitfor@} Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with a specific action to be performed \emph{after} the mutex function finishes. For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch. If there are multiple acceptable mutex calls, selection occurs top-to-bottom (prioritized) in the @waitfor@ clauses, whereas some programming languages with similar mechanisms accept nondeterministically for this case, \eg Go \lstinline[morekeywords=select]@select@. If some accept guards are true and there are no outstanding calls to these members, the acceptor is accept-blocked until a call to one of these members is made. If all the accept guards are false, the statement does nothing, unless there is a terminating @else@ clause with a true guard, which is executed instead. Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking. If there is a @timeout@ clause, it provides an upper bound on waiting. If both a @timeout@ clause and an @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered. In all cases, the statement following is executed \emph{after} a clause is executed to know which of the clauses executed. \begin{figure} \centering \begin{cfa} when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ waitfor( $\emph{mutex-member-name}$ ) $\emph{statement}$                                      $\C{// action after call}$ or when ( $\emph{conditional-expression}$ ) $\C{// optional guard}$ waitfor( $\emph{mutex-member-name}$ ) $\emph{statement}$                                      $\C{// action after call}$ or    ...                                                                     $\C{// list of waitfor clauses}$ when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ timeout                                                               $\C{// optional terminating timeout clause}$ $\emph{statement}$                                      $\C{// action after timeout}$ when ( $\emph{conditional-expression}$ )      $\C{// optional guard}$ else                                                                  $\C{// optional terminating clause}$ $\emph{statement}$                                      $\C{// action when no immediate calls}$ \end{cfa} \caption{Extended \protect\lstinline@waitfor@} \label{f:ExtendedWaitfor} \end{figure} Note, a group of conditional @waitfor@ clauses is \emph{not} the same as a group of @if@ statements, e.g.: \begin{cfa} if ( C1 ) waitfor( mem1 );                       when ( C1 ) waitfor( mem1 ); else if ( C2 ) waitfor( mem2 );         or when ( C2 ) waitfor( mem2 ); \end{cfa} The left example only accepts @mem1@ if @C1@ is true or only @mem2@ if @C2@ is true. The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true. An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated. \begin{cfa} void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) { if ( count == 10 ) waitfor( remove, buffer ) { // insert elem into buffer } or waitfor( ^?{}, buffer ) throw insertFail; } \end{cfa} When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action. However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined. Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object. Accepting the destructor is an idiomatic way to terminate a thread in \CFA. For situations where threads do not require direct communication, case 9 provides faster creation/destruction by eliminating @mutex@ setup. \begin{table}[b] \begin{table} \caption{Object property composition} \centering No              & No                                    & \textbf{1}\ \ \ aggregate type                & \textbf{2}\ \ \ @monitor@ aggregate type \\ \hline No              & Yes (stackless)               & \textbf{3}\ \ \ @generator@                   & \textbf{4}\ \ \ @monitor@ generator \\ No              & Yes (stackless)               & \textbf{3}\ \ \ @generator@                   & \textbf{4}\ \ \ @monitor@ @generator@ \\ \hline No              & Yes (stackful)                & \textbf{5}\ \ \ @coroutine@                   & \textbf{6}\ \ \ @monitor@ @coroutine@ \\ \section{Parallelism} \label{s:Parallelism} Historically, computer performance was about processor speeds. However, with heat dissipation being a direct consequence of speed increase, parallelism is the new source for increased performance~\cite{Sutter05, Sutter05b}. Therefore, high-performance applications must care about parallelism, which requires concurrency. The lowest-level approach of parallelism is to use \newterm{kernel threads} in combination with semantics like @fork@, @join@, \etc. However, kernel threads are better as an implementation tool because of complexity and higher cost. Therefore, different abstractions are often layered onto kernel threads to simplify them, \eg pthreads. \subsection{User Threads} A direct improvement on kernel threads is user threads, \eg Erlang~\cite{Erlang} and \uC~\cite{uC++book}. This approach provides an interface that matches the language paradigms, gives more control over concurrency by the language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems. In many cases, user threads can be used on a much larger scale (100,000 threads). Like kernel threads, user threads support preemption, which maximizes nondeterminism, but increases the potential for concurrency errors: race, livelock, starvation, and deadlock. \CFA adopts user-threads as they represent the truest realization of concurrency and can build any other concurrency approach, \eg thread pools and actors~\cite{Actors}. A variant of user thread is \newterm{fibres}, which removes preemption, \eg Go~\cite{Go} @goroutine@s. Like functional programming, which removes mutation and its associated problems, removing preemption from concurrency reduces nondeterminism, making race and deadlock errors more difficult to generate. However, preemption is necessary for concurrency that relies on spinning, so there are a class of problems that cannot be programmed with fibres. \subsection{Thread Pools} In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution. If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers. While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. As well, concurrency errors return, which threads pools are suppose to mitigate. % \section{Parallelism} % \label{s:Parallelism} % % Historically, computer performance was about processor speeds. % However, with heat dissipation being a direct consequence of speed increase, parallelism is the new source for increased performance~\cite{Sutter05, Sutter05b}. % Therefore, high-performance applications must care about parallelism, which requires concurrency. % The lowest-level approach of parallelism is to use \newterm{kernel threads} in combination with semantics like @fork@, @join@, \etc. % However, kernel threads are better as an implementation tool because of complexity and higher cost. % Therefore, different abstractions are often layered onto kernel threads to simplify them, \eg pthreads. % % % \subsection{User Threads} % % A direct improvement on kernel threads is user threads, \eg Erlang~\cite{Erlang} and \uC~\cite{uC++book}. % This approach provides an interface that matches the language paradigms, gives more control over concurrency by the language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems. % In many cases, user threads can be used on a much larger scale (100,000 threads). % Like kernel threads, user threads support preemption, which maximizes nondeterminism, but increases the potential for concurrency errors: race, livelock, starvation, and deadlock. % \CFA adopts user-threads to provide more flexibility and a low-cost mechanism to build any other concurrency approach, \eg thread pools and actors~\cite{Actors}. % % A variant of user thread is \newterm{fibres}, which removes preemption, \eg Go~\cite{Go} @goroutine@s. % Like functional programming, which removes mutation and its associated problems, removing preemption from concurrency reduces nondeterminism, making race and deadlock errors more difficult to generate. % However, preemption is necessary for fairness and to reduce tail-latency. % For concurrency that relies on spinning, if all cores spin the system is livelocked, whereas preemption breaks the livelock. % % % \subsection{Thread Pools} % % In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution. % If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. % While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. % Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers. % While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. % As well, concurrency errors return, which threads pools are suppose to mitigate. \section{\protect\CFA Runtime Structure} \label{s:CFARuntimeStructure} Figure~\ref{f:RunTimeStructure} illustrates the runtime structure of a \CFA program. \label{s:RuntimeStructureCluster} A \newterm{cluster} is a collection of threads and virtual processors (abstract kernel-thread) that execute the threads (like a virtual machine). A \newterm{cluster} is a collection of threads and virtual processors (abstract kernel-thread) that execute the threads from its own ready queue (like an OS). The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults. The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors. However, the scheduler is pluggable, supporting alternative schedulers. However, the scheduler is pluggable, supporting alternative schedulers, such as multi-queue multi-server, with work-stealing/sharing. If several clusters exist, both threads and virtual processors, can be explicitly migrated from one cluster to another. No automatic load balancing among clusters is performed by \CFA. \subsection{Debug Kernel} There are two versions of the \CFA runtime kernel: debug and non-debug. The debugging version has many runtime checks and internal assertions, \eg stack (non-writable) guard page, and checks for stack overflow whenever context switches occur among coroutines and threads, which catches most stack overflows. After a program is debugged, the non-debugging version can be used to decrease space and increase performance. \begin{comment} \section{Implementation} \label{s:Implementation} The exception to this rule is the program main, \ie the initial kernel thread that is given to any program. In order to respect C expectations, the stack of the initial kernel thread is used by program main rather than the main processor, allowing it to grow dynamically as in a normal C program. Finally, an important aspect for a complete threading system is preemption, which introduces extra non-determinism via transparent interleaving, rather than cooperation among threads for proper scheduling and processor fairness from long-running threads. Because preemption frequency is usually long (1 millisecond) performance cost is negligible. \end{comment} \subsection{Preemption} Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on section of code between cooperative scheduling to be atomic, A separate reason for not supporting preemption is that it significantly complicates the runtime system. Preemption is normally handled by setting a count-down timer on each virtual processor. When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code. The only issue with this approach is that signal masks from one kernel thread may be restored on another as part of returning from the signal handler; therefore, the same signal mask is required for all virtual processors in a cluster. However, on current UNIX systems: Because preemption frequency is usually long (1 millisecond) performance cost is negligible. However, on current Linux systems: \begin{cquote} A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. SIGNAL(7) - Linux Programmer's Manual \end{cquote} Hence, the timer-expiry signal, which is generated \emph{externally} by the UNIX kernel to the UNIX process, is delivered to any of its UNIX subprocesses (kernel threads). Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to the Linux process, is delivered to any of its Linux subprocesses (kernel threads). To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events. Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order. The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed. Processing a preemption event sends an \emph{internal} @SIGUSR1@ signal to the registered virtual processor, which is always delivered to that processor. \subsection{Debug Kernel} There are two versions of the \CFA runtime kernel: debug and non-debug. The debugging version has many runtime checks and internal assertions, \eg stack (non-writable) guard page, and checks for stack overflow whenever context switches occur among coroutines and threads, which catches most stack overflows. After a program is debugged, the non-debugging version can be used to decrease space and increase performance. All benchmarks are run using the following harness. \newpage \begin{cfa} unsigned int N = 10_000_000; Each benchmark is performed @N@ times, where @N@ varies depending on the benchmark; the total time is divided by @N@ to obtain the average time for a benchmark. All omitted tests for other languages are functionally identical to the shown \CFA test. All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallBenchMarks}. \paragraph{Object Creation} Object creation is measured by creating/deleting the specific kind of concurrent object. Figure~\ref{f:creation} shows the code for \CFA, with results in Table~\ref{tab:creation}. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine to force stack creation, the creation cost is artificially low. \newpage \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} thread MyThread {}; void main( MyThread & ) {} int main() { BENCH( for ( N ) { @MyThread m;@ } ) sout | resultns; } \end{cfa} \captionof{figure}{\CFA object-creation benchmark} \label{f:creation} \columnbreak \vspace*{-16pt} \captionof{table}{Object creation comparison (nanoseconds)} \label{tab:creation} \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ Pthreads                                & 28091         & 28073.39      & 163.1         \\ \CFA Coroutine Lazy             & 6                     & 6.07          & 0.26          \\ \CFA Coroutine Eager    & 520           & 520.61        & 2.04          \\ \CFA Thread                             & 2032          & 2016.29       & 112.07        \\ \uC Coroutine                   & 106           & 107.36        & 1.47          \\ \uC Thread                              & 536.5         & 537.07        & 4.64          \\ Goroutine                               & 3103          & 3086.29       & 90.25         \\ Java Thread                             & 103416.5      & 103732.29     & 1137          \\ \end{tabular} \end{multicols} \begin{cfa}[aboveskip=0pt,belowskip=0pt] coroutine C {} c; void main( C & ) { for ( ;; ) { @suspend();@ } } void main( C & ) { for ( ;; ) { @suspend;@ } } int main() { // coroutine test BENCH( for ( N ) { @resume( c );@ } ) \paragraph{Mutual-Exclusion} Mutual exclusion is measured by entering/leaving a critical section. Uncontented mutual exclusion, which occurs frequently, is measured by entering/leaving a critical section. For monitors, entering and leaving a monitor function is measured. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. \paragraph{Internal Scheduling} Internal scheduling is measured by waiting on and signalling a condition variable. Internal scheduling is measured using a cycle of two threads signalling and waiting. Figure~\ref{f:int-sched} shows the code for \CFA, with results in Table~\ref{tab:int-sched}. Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. monitor M { condition c; } m; void __attribute__((noinline)) do_call( M & mutex a1 ) { signal( c ); } do_call( M & mutex a1 ) { @signal( c );@ } thread T {}; void main( T & this ) { while ( go == 0 ) { yield(); } while ( go == 1 ) { @do_call( m );@ } while ( go == 1 ) { do_call( m ); } } int  __attribute__((noinline)) \paragraph{External Scheduling} External scheduling is measured by accepting a call using the @waitfor@ statement (@_Accept@ in \uC). External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement. Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}. Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. \paragraph{Object Creation} Object creation is measured by creating/deleting the specific kind of concurrent object. Figure~\ref{f:creation} shows the code for \CFA, with results in Table~\ref{tab:creation}. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine to force stack creation, the creation cost is artificially low. \begin{multicols}{2} \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{}{}} \begin{cfa} thread MyThread {}; void main( MyThread & ) {} int main() { BENCH( for ( N ) { @MyThread m;@ } ) sout | result`ns; } \end{cfa} \captionof{figure}{\CFA object-creation benchmark} \label{f:creation} \columnbreak \vspace*{-16pt} \captionof{table}{Object creation comparison (nanoseconds)} \label{tab:creation} \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ Pthreads                                & 28091         & 28073.39      & 163.1         \\ \CFA Coroutine Lazy             & 6                     & 6.07          & 0.26          \\ \CFA Coroutine Eager    & 520           & 520.61        & 2.04          \\ \CFA Thread                             & 2032          & 2016.29       & 112.07        \\ \uC Coroutine                   & 106           & 107.36        & 1.47          \\ \uC Thread                              & 536.5         & 537.07        & 4.64          \\ Goroutine                               & 3103          & 3086.29       & 90.25         \\ Java Thread                             & 103416.5      & 103732.29     & 1137          \\ \end{tabular} \end{multicols} \section{Conclusion} The \CFA runtime provides concurrency based on a preemptive M:N user-level threading-system, executing in clusters, which encapsulate scheduling of work on multiple kernel threads providing parallelism. The M:N model is judged to be efficient and provide greater flexibility than a 1:1 threading model. These concepts and the entire \CFA runtime-system are written in the \CFA language, extensively leveraging the \CFA type-system, which demonstrates the expressiveness of the \CFA language. These concepts and the \CFA runtime-system are written in the \CFA language, extensively leveraging the \CFA type-system, which demonstrates the expressiveness of the \CFA language. Performance comparisons with other concurrent systems/languages show the \CFA approach is competitive across all low-level operations, which translates directly into good performance in well-written concurrent applications. C programmers should feel comfortable using these mechanisms for developing complex control-flow in applications, with the ability to obtain maximum available performance by selecting mechanisms at the appropriate level of need. The authors would like to recognize the design assistance of Aaron Moss, Rob Schluntz and Andrew Beach on the features described in this paper. Funding for this project has been provided by Huawei Ltd.\ (\url{http://www.huawei.com}), and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada. Funding for this project has been provided by Huawei Ltd.\ (\url{http://www.huawei.com}). %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada. {%