Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision c6dc7f21316a7e361444c525b229e2e99fcf5004)
+++ doc/papers/concurrency/Paper.tex	(revision 54532378b72c2a8d3797e5f161bae32815a50349)
@@ -307,7 +307,7 @@
 In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control-flow capabilities on top of a highly popular programming language allowing immediate dissemination.
 Within the \CFA framework, new control-flow features are created from scratch because ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}.
-However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error-prone;
+However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and mutex/condition locks, which is low-level and error-prone;
 no high-level language concurrency features are defined.
-Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
+Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach (possibly because the effort to add concurrency to \CC).
 Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
 
@@ -333,5 +333,7 @@
 
 Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
-Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging} (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow.
+Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging}\footnote{
+The notion of competitive succession instead of direct handoff, \ie a lock owner releases the lock and an arriving thread acquires it ahead of preexisting waiter threads.
+} (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow.
 However, spurious wakeup is \emph{not} a foundational concurrency property~\cite[\S~8]{Buhr05a}, it is a performance design choice.
 Similarly, signals-as-hints are often a performance decision.
@@ -351,5 +353,5 @@
 We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
 The main contributions of this work are:
-\begin{itemize}
+\begin{itemize}[topsep=3pt,itemsep=1pt]
 \item
 language-level generators, coroutines and user-level threading, which respect the expectations of C programmers.
@@ -370,6 +372,14 @@
 \end{itemize}
 
+Section~\ref{s:StatefulFunction} begins advanced control by introducing sequential functions that retain data and execution state between calls, which produces constructs @generator@ and @coroutine@.
+Section~\ref{s:Concurrency} begins concurrency, or how to create (fork) and destroy (join) a thread, which produces the @thread@ construct.
+Section~\ref{s:MutualExclusionSynchronization} discusses the two mechanisms to restricted nondeterminism when controlling shared access to resources (mutual exclusion) and timing relationships among threads (synchronization).
+Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
+Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure (cluster) threads and virtual processors (kernel threads).
+Section~\ref{s:Performance} uses a series of microbenchmarks to compare \CFA threading with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0.
+
 
 \section{Stateful Function}
+\label{s:StatefulFunction}
 
 The stateful function is an old idea~\cite{Conway63,Marlin80} that is new again~\cite{C++20Coroutine19}, where execution is temporarily suspended and later resumed, \eg plugin, device driver, finite-state machine.
@@ -617,5 +627,7 @@
 Figure~\ref{f:CFibonacciSim} shows the C implementation of the \CFA generator only needs one additional field, @next@, to handle retention of execution state.
 The computed @goto@ at the start of the generator main, which branches after the previous suspend, adds very little cost to the resume call.
-Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.
+Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.\footnote{
+The \CFA operator syntax uses \lstinline|?| to denote operands, which allows precise definitions for pre, post, and infix operators, \eg \lstinline|++?|, \lstinline|?++|, and \lstinline|?+?|, in addition \lstinline|?\{\}| denotes a constructor, as in \lstinline|foo `f` = `\{`...`\}`|, \lstinline|^?\{\}| denotes a destructor, and \lstinline|?()| is \CC function call \lstinline|operator()|.
+}%
 \begin{cfa}
 int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$
@@ -1511,4 +1523,5 @@
 
 \section{Mutual Exclusion / Synchronization}
+\label{s:MutualExclusionSynchronization}
 
 Unrestricted nondeterminism is meaningless as there is no way to know when the result is completed without synchronization.
@@ -1551,5 +1564,5 @@
 higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg receive-specific versus receive-any thread in message passing or offering specialized solutions, \eg barrier lock.
 Often synchronization is used to order access to a critical section, \eg ensuring a waiting writer thread enters the critical section before a calling reader thread.
-If the calling reader is scheduled before the waiting writer, the reader has \newterm{barged}.
+If the calling reader is scheduled before the waiting writer, the reader has barged.
 Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation).
 Preventing or detecting barging is an involved challenge with low-level locks, which is made easier through higher-level constructs.
@@ -2120,7 +2133,7 @@
 
 
-\subsection{Extended \protect\lstinline@waitfor@}
-
-Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes.
+\subsection{\texorpdfstring{Extended \protect\lstinline@waitfor@}{Extended waitfor}}
+
+Figure~\ref{f:ExtendedWaitfor} shows the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes.
 For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist.
 The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch.
@@ -2131,4 +2144,5 @@
 Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking.
 If both @timeout@ and @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered.
+There is also a traditional future wait queue (not shown) (\eg Microsoft (@WaitForMultipleObjects@)), to wait for a specified number of future elements in the queue.
 
 \begin{figure}
@@ -2355,5 +2369,5 @@
 
 
-\subsection{\protect\lstinline@mutex@ Threads}
+\subsection{\texorpdfstring{\protect\lstinline@mutex@ Threads}{mutex Threads}}
 
 Threads in \CFA can also be monitors to allow \emph{direct communication} among threads, \ie threads can have mutex functions that are called by other threads.
@@ -2499,6 +2513,6 @@
 \renewcommand{\arraystretch}{1.25}
 %\setlength{\tabcolsep}{5pt}
-\begin{tabular}{c|c|l|l}
-\multicolumn{2}{c|}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\
+\begin{tabular}{c|c||l|l}
+\multicolumn{2}{c||}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\
 \hline
 thread	& stateful				& \multicolumn{1}{c|}{No} & \multicolumn{1}{c}{Yes} \\
@@ -2605,5 +2619,5 @@
 
 
-\section{\protect\CFA Runtime Structure}
+\section{Runtime Structure}
 \label{s:CFARuntimeStructure}
 
@@ -2709,5 +2723,5 @@
 
 \section{Performance}
-\label{results}
+\label{s:Performance}
 
 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0.
@@ -2715,5 +2729,5 @@
 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \CFA/\uC are compiled with gcc 6.5.
 
-All benchmarks are run using the following harness.
+All benchmarks are run using the following harness. (The Java harness is augmented to circumvent JIT issues.)
 \begin{cfa}
 unsigned int N = 10_000_000;
@@ -2754,13 +2768,150 @@
 \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA Coroutine Lazy		& 14.3		& 14.3		& 0.32		\\
-\CFA Coroutine Eager	& 522.8		& 525.3		& 5.81		\\
-\CFA Thread				& 1257.8	& 1291.2	& 86.19		\\
-\uC Coroutine			& 92.2		& 91.4		& 1.58		\\
-\uC Thread				& 499.5		& 500.1		& 5.67		\\
-Goroutine				& 4397.0	& 4362.8	& 390.77	\\
-Java Thread				& 107405.0	& 107794.8	& 1601.33	\\
-% Qthreads				& 159.9		& 159.6		& 0.73		\\
-Pthreads				& 32920.9	& 32882.7	& 213.55
+\CFA Coroutine Lazy		& 13.2		& 13.1		& 0.44		\\
+\CFA Coroutine Eager	& 531.3		& 536.0		& 26.54		\\
+\CFA Thread				& 2074.9	& 2066.5	& 170.76	\\
+\uC Coroutine			& 89.6		& 90.5		& 1.83		\\
+\uC Thread				& 528.2		& 528.5		& 4.94		\\
+Goroutine				& 4068.0	& 4113.1	& 414.55	\\
+Java Thread				& 103848.5	& 104295.4	& 2637.57	\\
+Pthreads				& 33112.6	& 33127.1	& 165.90
+\end{tabular}
+\end{multicols}
+
+
+\paragraph{Context-Switching}
+
+In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
+(In many cases, a compiler inlines function calls to eliminate this cost.)
+Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
+The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
+The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
+The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
+Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\begin{cfa}[aboveskip=0pt,belowskip=0pt]
+@coroutine@ C {} c;
+void main( C & ) { for ( ;; ) { @suspend;@ } }
+int main() { // coroutine test
+	BENCH( for ( N ) { @resume( c );@ } )
+	sout | result`ns;
+}
+int main() { // task test
+	BENCH( for ( N ) { @yield();@ } )
+	sout | result`ns;
+}
+\end{cfa}
+\captionof{figure}{\CFA context-switch benchmark}
+\label{f:ctx-switch}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{Context switch comparison (nanoseconds)}
+\label{tab:ctx-switch}
+\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+C function		& 1.8	& 1.8	& 0.01	\\
+\CFA generator	& 2.4	& 2.2	& 0.25	\\
+\CFA Coroutine	& 36.2	& 36.2	& 0.25	\\
+\CFA Thread		& 93.2	& 93.5	& 2.09	\\
+\uC Coroutine	& 52.0	& 52.1	& 0.51	\\
+\uC Thread		& 96.2	& 96.3	& 0.58	\\
+Goroutine		& 141.0	& 141.3	& 3.39	\\
+Java Thread		& 374.0	& 375.8	& 10.38	\\
+Pthreads Thread	& 361.0	& 365.3 & 13.19
+\end{tabular}
+\end{multicols}
+
+
+\paragraph{Mutual-Exclusion}
+
+Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section.
+For monitors, entering and leaving a monitor function is measured.
+To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
+Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
+Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\begin{cfa}
+@monitor@ M {} m1/*, m2, m3, m4*/;
+void __attribute__((noinline))
+do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
+int main() {
+	BENCH(
+		for( N ) do_call( m1/*, m2, m3, m4*/ );
+	)
+	sout | result`ns;
+}
+\end{cfa}
+\captionof{figure}{\CFA acquire/release mutex benchmark}
+\label{f:mutex}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{Mutex comparison (nanoseconds)}
+\label{tab:mutex}
+\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+test and test-and-test lock		& 19.1	& 18.9	& 0.40	\\
+\CFA @mutex@ function, 1 arg.	& 45.9	& 46.6	& 1.45	\\
+\CFA @mutex@ function, 2 arg.	& 105.0	& 104.7	& 3.08	\\
+\CFA @mutex@ function, 4 arg.	& 165.0	& 167.6	& 5.65	\\
+\uC @monitor@ member rtn.		& 54.0	& 53.7	& 0.82	\\
+Java synchronized method		& 31.0	& 31.1	& 0.50	\\
+Pthreads Mutex Lock				& 33.6	& 32.6	& 1.14
+\end{tabular}
+\end{multicols}
+
+
+\paragraph{External Scheduling}
+
+External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
+Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}.
+Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+
+\begin{multicols}{2}
+\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\vspace*{-16pt}
+\begin{cfa}
+volatile int go = 0;
+@monitor@ M {} m;
+thread T {};
+void __attribute__((noinline))
+do_call( M & @mutex@ ) {}
+void main( T & ) {
+	while ( go == 0 ) { yield(); }
+	while ( go == 1 ) { do_call( m ); }
+}
+int __attribute__((noinline))
+do_wait( M & @mutex@ m ) {
+	go = 1;	// continue other thread
+	BENCH( for ( N ) { @waitfor( do_call, m );@ } )
+	go = 0;	// stop other thread
+	sout | result`ns;
+}
+int main() {
+	T t;
+	do_wait( m );
+}
+\end{cfa}
+\captionof{figure}{\CFA external-scheduling benchmark}
+\label{f:ext-sched}
+
+\columnbreak
+
+\vspace*{-16pt}
+\captionof{table}{External-scheduling comparison (nanoseconds)}
+\label{tab:ext-sched}
+\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
+\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA @waitfor@, 1 @monitor@	& 376.4	& 376.8	& 7.63	\\
+\CFA @waitfor@, 2 @monitor@	& 491.4	& 492.0	& 13.31	\\
+\CFA @waitfor@, 4 @monitor@	& 681.0	& 681.7	& 19.10	\\
+\uC @_Accept@				& 331.1	& 331.4	& 2.66
 \end{tabular}
 \end{multicols}
@@ -2810,149 +2961,10 @@
 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @signal@, 1 @monitor@	& 367.0		& 371.5		& 17.34		\\
-\CFA @signal@, 2 @monitor@	& 477.2		& 478.6		& 8.31		\\
-\CFA @signal@, 4 @monitor@	& 725.8		& 734.0		& 17.98		\\
-\uC @signal@				& 322.8		& 323.0 	& 3.64		\\
-Java @notify@				& 16520.0	& 20096.7	& 9378.53	\\
-Pthreads Cond. Variable		& 4931.3	& 5057.0 	& 326.80
-\end{tabular}
-\end{multicols}
-
-
-\paragraph{External Scheduling}
-
-External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
-Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\vspace*{-16pt}
-\begin{cfa}
-volatile int go = 0;
-@monitor@ M {} m;
-thread T {};
-void __attribute__((noinline))
-do_call( M & @mutex@ ) {}
-void main( T & ) {
-	while ( go == 0 ) { yield(); }
-	while ( go == 1 ) { do_call( m ); }
-}
-int __attribute__((noinline))
-do_wait( M & @mutex@ m ) {
-	go = 1;	// continue other thread
-	BENCH( for ( N ) { @waitfor( do_call, m );@ } )
-	go = 0;	// stop other thread
-	sout | result`ns;
-}
-int main() {
-	T t;
-	do_wait( m );
-}
-\end{cfa}
-\captionof{figure}{\CFA external-scheduling benchmark}
-\label{f:ext-sched}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{External-scheduling comparison (nanoseconds)}
-\label{tab:ext-sched}
-\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @waitfor@, 1 @monitor@	& 366.7		& 369.5	& 7.52	\\
-\CFA @waitfor@, 2 @monitor@	& 453.6		& 455.8	& 12.38	\\
-\CFA @waitfor@, 4 @monitor@	& 671.6		& 672.4	& 14.16	\\
-\uC @_Accept@				& 336.0		& 335.8		& 3.22
-\end{tabular}
-\end{multicols}
-
-
-\paragraph{Context-Switching}
-
-In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
-(In many cases, a compiler inlines function calls to eliminate this cost.)
-Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
-The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
-The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
-The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
-Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}[aboveskip=0pt,belowskip=0pt]
-@coroutine@ C {} c;
-void main( C & ) { for ( ;; ) { @suspend;@ } }
-int main() { // coroutine test
-	BENCH( for ( N ) { @resume( c );@ } )
-	sout | result`ns;
-}
-int main() { // task test
-	BENCH( for ( N ) { @yield();@ } )
-	sout | result`ns;
-}
-\end{cfa}
-\captionof{figure}{\CFA context-switch benchmark}
-\label{f:ctx-switch}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{Context switch comparison (nanoseconds)}
-\label{tab:ctx-switch}
-\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-C function		& 1.8		& 1.8	& 0		\\
-\CFA generator	& 2.7		& 2.4	& 0.27	\\
-\CFA Coroutine	& 37.8		& 37.7	& 0.22	\\
-\CFA Thread		& 93.6		& 93.8	& 1.46	\\
-\uC Coroutine	& 52.7		& 52.8	& 0.28	\\
-\uC Thread		& 93.4		& 93.7	& 1.04	\\
-Goroutine		& 140.0		& 139.7	& 2.93	\\
-Java Thread		& 374.0		& 375.8	& 10.38	\\
-% Qthreads Thread	& 159.5		& 159.3	& 0.71	\\
-Pthreads Thread	& 334.4		& 335.0	& 1.95	\\
-\end{tabular}
-\end{multicols}
-
-
-\paragraph{Mutual-Exclusion}
-
-Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section.
-For monitors, entering and leaving a monitor function is measured.
-To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
-Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-
-\begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
-@monitor@ M {} m1/*, m2, m3, m4*/;
-void __attribute__((noinline))
-do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
-int main() {
-	BENCH(
-		for( N ) do_call( m1/*, m2, m3, m4*/ );
-	)
-	sout | result`ns;
-}
-\end{cfa}
-\captionof{figure}{\CFA acquire/release mutex benchmark}
-\label{f:mutex}
-
-\columnbreak
-
-\vspace*{-16pt}
-\captionof{table}{Mutex comparison (nanoseconds)}
-\label{tab:mutex}
-\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-test and test-and-test lock		& 19.1	& 19.0	& 0.36	\\
-\CFA @mutex@ function, 1 arg.	& 46.6	& 46.8	& 0.86	\\
-\CFA @mutex@ function, 2 arg.	& 84.1	& 85.3	& 1.86	\\
-\CFA @mutex@ function, 4 arg.	& 158.6	& 160.7	& 3.07	\\
-\uC @monitor@ member rtn.		& 54.0	& 53.7	& 0.83	\\
-Java synchronized method		& 27.0	& 27.1	& 0.25	\\
-Pthreads Mutex Lock				& 33.6	& 32.7	& 1.12
+\CFA @signal@, 1 @monitor@	& 372.6		& 374.3		& 14.17		\\
+\CFA @signal@, 2 @monitor@	& 492.7		& 494.1		& 12.99		\\
+\CFA @signal@, 4 @monitor@	& 749.4		& 750.4		& 24.74		\\
+\uC @signal@				& 320.5		& 321.0		& 3.36		\\
+Java @notify@				& 10160.5	& 10169.4	& 267.71	\\
+Pthreads Cond. Variable		& 4949.6	& 5065.2	& 363
 \end{tabular}
 \end{multicols}