Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 04b4a7113ee67169f02e5f1dd053025c3ef887de)
+++ doc/papers/concurrency/Paper.tex	(revision fe9cf9e29436b67438b8b8c2b0d3101ee2a18df1)
@@ -269,5 +269,5 @@
 
 \abstract[Summary]{
-\CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
+\CFA is a polymorphic, non-object-oriented, concurrent, backwards compatible extension of the C programming language.
 This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime written in \CFA.
 These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library approaches like pthreads.
@@ -301,12 +301,12 @@
 The TIOBE index~\cite{TIOBE} for May 2020 ranks the top five \emph{popular} programming languages as C 17\%, Java 16\%, Python 9\%, \CC 6\%, and \Csharp 4\% = 52\%, and over the past 30 years, C has always ranked either first or second in popularity.}
 allowing immediate dissemination.
-This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA and its runtime, which is written entirely in \CFA.
-The \CFA control-flow framework extends ISO \Celeven~\cite{C11} with new call/return and concurrent/parallel control-flow.
+This paper discusses the design philosophy and implementation of \CFA's advanced control-flow and concurrent/parallel features, along with the supporting runtime written in \CFA.
 
 % The call/return extensions retain state between callee and caller versus losing the callee's state on return;
 % the concurrency extensions allow high-level management of threads.
 
+The \CFA control-flow framework extends ISO \Celeven~\cite{C11} with new call/return and concurrent/parallel control-flow.
 Call/return control-flow with argument and parameter passing appeared in the first programming languages.
-Over the past 50 years, call/return has been augmented with features like static/dynamic call, exceptions (multi-level return) and generators/coroutines (retain state between calls).
+Over the past 50 years, call/return has been augmented with features like static and dynamic call, exceptions (multi-level return) and generators/coroutines (see Section~\ref{s:StatefulFunction}).
 While \CFA has mechanisms for dynamic call (algebraic effects~\cite{Zhang19}) and exceptions\footnote{
 \CFA exception handling will be presented in a separate paper.
@@ -314,5 +314,5 @@
 \newterm{Coroutining} was introduced by Conway~\cite{Conway63} (1963), discussed by Knuth~\cite[\S~1.4.2]{Knuth73V1}, implemented in Simula67~\cite{Simula67}, formalized by Marlin~\cite{Marlin80}, and is now popular and appears in old and new programming languages: CLU~\cite{CLU}, \Csharp~\cite{Csharp}, Ruby~\cite{Ruby}, Python~\cite{Python}, JavaScript~\cite{JavaScript}, Lua~\cite{Lua}, \CCtwenty~\cite{C++20Coroutine19}.
 Coroutining is sequential execution requiring direct handoff among coroutines, \ie only the programmer is controlling execution order.
-If coroutines transfer to an internal event-engine for scheduling the next coroutines, the program transitions into the realm of concurrency~\cite[\S~3]{Buhr05a}.
+If coroutines transfer to an internal event-engine for scheduling the next coroutines (as in async-await), the program transitions into the realm of concurrency~\cite[\S~3]{Buhr05a}.
 Coroutines are only a stepping stone towards concurrency where the commonality is that coroutines and threads retain state between calls.
 
@@ -324,9 +324,9 @@
 Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}.
 Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads.
-As a result, many current languages implementations adopt the 1:1 kernel-threading model, like Java (Scala), Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, C\#~\cite{Csharp} and Rust~\cite{Rust}, with a variety of presentation mechanisms.
+As a result, many languages adopt the 1:1 kernel-threading model, like Java (Scala), Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, C\#~\cite{Csharp} and Rust~\cite{Rust}, with a variety of presentation mechanisms.
 From 2000 onwards, several language implementations have championed the M:N user-threading model, like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book}, including putting green threads back into Java~\cite{Quasar}, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}.
 The main argument for user-level threading is that it is lighter weight than kernel threading because locking and context switching do not cross the kernel boundary, so there is less restriction on programming styles that encourages large numbers of threads performing medium-sized work to facilitate load balancing by the runtime~\cite{Verch12}.
 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{Adya02,vonBehren03}.
-Finally, performant user-threading implementations, both time and space, meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
+Finally, performant user-threading implementations, both in time and space, meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
 
 A further effort over the past two decades is the development of language memory models to deal with the conflict between language features and compiler/hardware optimizations, \eg some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}.
@@ -338,5 +338,5 @@
 
 Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
-Two concurrency violations of this philosophy are \emph{spurious or random wakeup}~\cite[\S~9]{Buhr05a}) and \emph{barging}\footnote{
+Two concurrency violations of this philosophy are \emph{spurious} or \emph{random wakeup}~\cite[\S~9]{Buhr05a}, and \emph{barging}\footnote{
 Barging is competitive succession instead of direct handoff, \ie after a lock is released both arriving and preexisting waiter threads compete to acquire the lock.
 Hence, an arriving thread can temporally \emph{barge} ahead of threads already waiting for an event, which can repeat indefinitely leading to starvation of waiter threads.
@@ -386,5 +386,5 @@
 Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
 Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads).
-Section~\ref{s:Performance} uses a series of microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
+Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
 
 
@@ -395,5 +395,5 @@
 To this end, the control-flow features created for \CFA are based on the fundamental properties of any language with function-stack control-flow (see also \uC~\cite[pp.~140-142]{uC++}).
 The fundamental properties are execution state, thread, and mutual-exclusion/synchronization (MES).
-These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread/task/process.
+These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread.
 While it is possible for a language to only provide threads for composing programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient solutions to certain classes of problems.
 As is shown, each of the non-rejected composed language features solves a particular set of problems, and hence, has a defensible position in a programming language.
@@ -405,5 +405,5 @@
 is the state information needed by a control-flow feature to initialize, manage compute data and execution location(s), and de-initialize, \eg calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
 State is retained in fixed-sized aggregate structures (objects) and dynamic-sized stack(s), often allocated in the heap(s) managed by the runtime system.
-The lifetime of the state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
+The lifetime of state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
 Control-flow transfers among execution states in multiple ways, such as function call, context switch, asynchronous await, etc.
 Because the programming language determines what constitutes an execution state, implicitly manages this state, and defines movement mechanisms among states, execution state is an elementary property of the semantics of a programming language.
@@ -411,8 +411,8 @@
 
 \item[\newterm{threading}:]
-is execution of code that occurs independently of other execution, \ie the execution resulting from a thread is sequential.
+is execution of code that occurs independently of other execution, where an individual thread's execution is sequential.
 Multiple threads provide \emph{concurrent execution};
 concurrent execution becomes parallel when run on multiple processing units, \eg hyper-threading, cores, or sockets.
-There must be language mechanisms to create, block and unblock, and join with a thread.
+There must be language mechanisms to create, block and unblock, and join with a thread, even if the mechanism is indirect.
 
 \item[\newterm{MES}:]
@@ -421,5 +421,5 @@
 Limiting MES, \eg no access to shared data, results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
 \end{description}
-These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES using atomic hardware mechanisms.
+These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without atomic hardware mechanisms.
 
 
@@ -432,5 +432,5 @@
 Table~\ref{t:ExecutionPropertyComposition} shows all combinations of the three fundamental execution properties available to language designers.
 (When doing combination case-analysis, not all combinations are meaningful.)
-The combinations of state, thread, and mutual exclusion compose a hierarchy of control-flow features all of which have appeared in prior programming languages, where each of these languages have found the feature useful.
+The combinations of state, thread, and MES compose a hierarchy of control-flow features all of which have appeared in prior programming languages, where each of these languages have found the feature useful.
 To understand the table, it is important to review the basic von Neumann execution requirement of at least one thread and execution state providing some form of call stack.
 For table entries missing these minimal components, the property is borrowed from the invoker (caller).
@@ -468,5 +468,5 @@
 A @mutex@ structure, often called a \newterm{monitor}, provides a high-level interface for race-free access of shared data in concurrent programming-languages.
 Case 3 is case 1 where the structure can implicitly retain execution state and access functions use this execution state to resume/suspend across \emph{callers}, but resume/suspend does not retain a function's local state.
-A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, where the stack is used only to preserve state across its callees not callers.
+A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
 Generators provide the first step toward directly solving problems like finite-state machines that retain data and execution state between calls, whereas normal functions restart on each call.
 Case 4 is cases 2 and 3 with thread safety during execution of the generator's access functions.
@@ -480,5 +480,5 @@
 Cases 11 and 12 are a stackful thread with and without safe access to shared state.
 A thread is the language mechanism to start another thread of control in a program with growable execution state for call/return execution.
-In general, more execution properties increase the cost of creation and execution along with complexity of usage.
+In general, language constructs with more execution properties increase the cost of creation and execution along with complexity of usage.
 
 Given the execution-properties taxonomy, programmers now ask three basic questions: is state necessary across callers and how much, is a separate thread necessary, is thread safety necessary.
@@ -594,5 +594,5 @@
 &
 \begin{cfa}
-void * rtn( void * arg ) { ... }
+void * `rtn`( void * arg ) { ... }
 int i = 3, rc;
 pthread_t t; $\C{// thread id}$
@@ -3037,4 +3037,143 @@
 \end{multicols}
 
+\vspace*{-10pt}
+\paragraph{Internal Scheduling}
+
+Internal scheduling is measured using a cycle of two threads signalling and waiting.
+Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
+Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\begin{cfa}
+volatile int go = 0;
+@condition c;@
+@monitor@ M {} m1/*, m2, m3, m4*/;
+void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
+	@signal( c );@
+}
+void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
+	go = 1;	// continue other thread
+	for ( N ) { @wait( c );@ } );
+}
+thread T {};
+void main( T & ) {
+	while ( go == 0 ) { yield(); } // waiter must start first
+	BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
+	sout | result;
+}
+int main() {
+	T t;
+	wait( m1/*, m2, m3, m4*/ );
+}
+\end{cfa}
+\vspace*{-8pt}
+\captionof{figure}{\CFA Internal-scheduling benchmark}
+\label{f:schedint}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{Internal-scheduling comparison (nanoseconds)}
+\label{t:schedint}
+\bigskip
+
+\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA @signal@, 1 monitor	& 364.4		& 364.2		& 4.4		\\
+\CFA @signal@, 2 monitor	& 484.4		& 483.9		& 8.8		\\
+\CFA @signal@, 4 monitor	& 709.1		& 707.7		& 15.0		\\
+\uC @signal@ monitor		& 328.3		& 327.4		& 2.4		\\
+Rust cond. variable			& 7514.0	& 7437.4	& 397.2		\\
+Java @notify@ monitor		& 9623.0	& 9654.6	& 236.2		\\
+Pthreads cond. variable		& 5553.7	& 5576.1	& 345.6
+\end{tabular}
+\end{multicols}
+
+
+\paragraph{External Scheduling}
+
+External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
+Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
+Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\vspace*{-16pt}
+\begin{cfa}
+@monitor@ M {} m1/*, m2, m3, m4*/;
+void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
+void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
+	for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
+}
+thread T {};
+void main( T & ) {
+	BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
+	sout | result;
+}
+int main() {
+	T t;
+	wait( m1/*, m2, m3, m4*/ );
+}
+\end{cfa}
+\captionof{figure}{\CFA external-scheduling benchmark}
+\label{f:schedext}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{External-scheduling comparison (nanoseconds)}
+\label{t:schedext}
+\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA @waitfor@, 1 monitor	& 367.1	& 365.3	& 5.0	\\
+\CFA @waitfor@, 2 monitor	& 463.0	& 464.6	& 7.1	\\
+\CFA @waitfor@, 4 monitor	& 689.6	& 696.2	& 21.5	\\
+\uC \lstinline[language=uC++]|_Accept| monitor	& 328.2	& 329.1	& 3.4	\\
+Go \lstinline[language=Golang]|select| channel	& 365.0	& 365.5	& 1.2
+\end{tabular}
+\end{multicols}
+
+\paragraph{Mutual-Exclusion}
+
+Uncontented mutual exclusion, which frequently occurs, is measured by entering and leaving a critical section.
+For monitors, entering and leaving a mutex function is measured, otherwise the language-appropriate mutex-lock is measured.
+For comparison, a spinning (versus blocking) test-and-test-set lock is presented.
+Figure~\ref{f:mutex} shows the code for \CFA with results in Table~\ref{t:mutex}.
+Note the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\begin{cfa}
+@monitor@ M {} m1/*, m2, m3, m4*/;
+call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
+int main() {
+	BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
+	sout | result;
+}
+\end{cfa}
+\captionof{figure}{\CFA acquire/release mutex benchmark}
+\label{f:mutex}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{Mutex comparison (nanoseconds)}
+\label{t:mutex}
+\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+test-and-test-set lock			& 19.1	& 18.9	& 0.4	\\
+\CFA @mutex@ function, 1 arg.	& 48.3	& 47.8	& 0.9	\\
+\CFA @mutex@ function, 2 arg.	& 86.7	& 87.6	& 1.9	\\
+\CFA @mutex@ function, 4 arg.	& 173.4	& 169.4	& 5.9	\\
+\uC @monitor@ member rtn.		& 54.8	& 54.8	& 0.1	\\
+Goroutine mutex lock			& 34.0	& 34.0	& 0.0	\\
+Rust mutex lock					& 33.0	& 33.2	& 0.8	\\
+Java synchronized method		& 31.0	& 31.0	& 0.0	\\
+Pthreads mutex Lock				& 31.0	& 31.1	& 0.4
+\end{tabular}
+\end{multicols}
+
 \paragraph{Context Switching}
 
@@ -3104,143 +3243,4 @@
 \end{multicols}
 
-\vspace*{-10pt}
-\paragraph{Internal Scheduling}
-
-Internal scheduling is measured using a cycle of two threads signalling and waiting.
-Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
-volatile int go = 0;
-@condition c;@
-@monitor@ M {} m1/*, m2, m3, m4*/;
-void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
-	@signal( c );@
-}
-void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
-	go = 1;	// continue other thread
-	for ( N ) { @wait( c );@ } );
-}
-thread T {};
-void main( T & ) {
-	while ( go == 0 ) { yield(); } // waiter must start first
-	BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
-	sout | result;
-}
-int main() {
-	T t;
-	wait( m1/*, m2, m3, m4*/ );
-}
-\end{cfa}
-\vspace*{-8pt}
-\captionof{figure}{\CFA Internal-scheduling benchmark}
-\label{f:schedint}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{Internal-scheduling comparison (nanoseconds)}
-\label{t:schedint}
-\bigskip
-
-\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @signal@, 1 monitor	& 364.4		& 364.2		& 4.4		\\
-\CFA @signal@, 2 monitor	& 484.4		& 483.9		& 8.8		\\
-\CFA @signal@, 4 monitor	& 709.1		& 707.7		& 15.0		\\
-\uC @signal@ monitor		& 328.3		& 327.4		& 2.4		\\
-Rust cond. variable			& 7514.0	& 7437.4	& 397.2		\\
-Java @notify@ monitor		& 9623.0	& 9654.6	& 236.2		\\
-Pthreads cond. variable		& 5553.7	& 5576.1	& 345.6
-\end{tabular}
-\end{multicols}
-
-
-\paragraph{External Scheduling}
-
-External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
-Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\vspace*{-16pt}
-\begin{cfa}
-@monitor@ M {} m1/*, m2, m3, m4*/;
-void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
-void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
-	for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
-}
-thread T {};
-void main( T & ) {
-	BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
-	sout | result;
-}
-int main() {
-	T t;
-	wait( m1/*, m2, m3, m4*/ );
-}
-\end{cfa}
-\captionof{figure}{\CFA external-scheduling benchmark}
-\label{f:schedext}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{External-scheduling comparison (nanoseconds)}
-\label{t:schedext}
-\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @waitfor@, 1 monitor	& 367.1	& 365.3	& 5.0	\\
-\CFA @waitfor@, 2 monitor	& 463.0	& 464.6	& 7.1	\\
-\CFA @waitfor@, 4 monitor	& 689.6	& 696.2	& 21.5	\\
-\uC \lstinline[language=uC++]|_Accept| monitor	& 328.2	& 329.1	& 3.4	\\
-Go \lstinline[language=Golang]|select| channel	& 365.0	& 365.5	& 1.2
-\end{tabular}
-\end{multicols}
-
-\paragraph{Mutual-Exclusion}
-
-Uncontented mutual exclusion, which frequently occurs, is measured by entering and leaving a critical section.
-For monitors, entering and leaving a mutex function is measured, otherwise the language-appropriate mutex-lock is measured.
-For comparison, a spinning (versus blocking) test-and-test-set lock is presented.
-Figure~\ref{f:mutex} shows the code for \CFA with results in Table~\ref{t:mutex}.
-Note the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
-@monitor@ M {} m1/*, m2, m3, m4*/;
-call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
-int main() {
-	BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
-	sout | result;
-}
-\end{cfa}
-\captionof{figure}{\CFA acquire/release mutex benchmark}
-\label{f:mutex}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{Mutex comparison (nanoseconds)}
-\label{t:mutex}
-\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-test-and-test-set lock			& 19.1	& 18.9	& 0.4	\\
-\CFA @mutex@ function, 1 arg.	& 48.3	& 47.8	& 0.9	\\
-\CFA @mutex@ function, 2 arg.	& 86.7	& 87.6	& 1.9	\\
-\CFA @mutex@ function, 4 arg.	& 173.4	& 169.4	& 5.9	\\
-\uC @monitor@ member rtn.		& 54.8	& 54.8	& 0.1	\\
-Goroutine mutex lock			& 34.0	& 34.0	& 0.0	\\
-Rust mutex lock					& 33.0	& 33.2	& 0.8	\\
-Java synchronized method		& 31.0	& 31.0	& 0.0	\\
-Pthreads mutex Lock				& 31.0	& 31.1	& 0.4
-\end{tabular}
-\end{multicols}
-
 
 \subsection{Discussion}
