Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 6b6b9ba93df695bae2af50bdf40f462d3d60f856)
+++ doc/papers/concurrency/Paper.tex	(revision d052a2c4663fcee729dd27db78b77852e89b7c36)
@@ -224,17 +224,17 @@
 {}
 \lstnewenvironment{C++}[1][]                            % use C++ style
-{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
 {}
 \lstnewenvironment{uC++}[1][]
-{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
 {}
 \lstnewenvironment{Go}[1][]
-{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
 {}
 \lstnewenvironment{python}[1][]
-{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
 {}
 \lstnewenvironment{java}[1][]
-{\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
+{\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
 {}
 
@@ -284,5 +284,5 @@
 
 \begin{document}
-\linenumbers				% comment out to turn off line numbering
+%\linenumbers				% comment out to turn off line numbering
 
 \maketitle
@@ -2896,5 +2896,5 @@
 \label{s:RuntimeStructureCluster}
 
-A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue.
+A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue~\cite{Buhr90a}.
 The term \newterm{virtual processor} is introduced as a synonym for kernel thread to disambiguate between user and kernel thread.
 From the language perspective, a virtual processor is an actual processor (core).
@@ -2992,7 +2992,10 @@
 \end{cfa}
 where CPU time in nanoseconds is from the appropriate language clock.
-Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language.
+Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language;
+each @N@ appears after the experiment name in the following tables.
 The total time is divided by @N@ to obtain the average time for a benchmark.
 Each benchmark experiment is run 13 times and the average appears in the table.
+For languages with a runtime JIT (Java, Node.js, Python), a single half-hour long experiment is run to check stability;
+all long-experiment results are statistically equivalent, \ie median/average/standard-deviation correlate with the short-experiment results, indicating the short experiments reached a steady state.
 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}.
 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark
@@ -3006,7 +3009,6 @@
 
 \begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
-@coroutine@ MyCoroutine {};
+\begin{cfa}[xleftmargin=0pt]
+`coroutine` MyCoroutine {};
 void ?{}( MyCoroutine & this ) {
 #ifdef EAGER
@@ -3016,5 +3018,5 @@
 void main( MyCoroutine & ) {}
 int main() {
-	BENCH( for ( N ) { @MyCoroutine c;@ } )
+	BENCH( for ( N ) { `MyCoroutine c;` } )
 	sout | result;
 }
@@ -3030,19 +3032,19 @@
 
 \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA generator			& 0.6		& 0.6		& 0.0		\\
-\CFA coroutine lazy		& 13.4		& 13.1		& 0.5		\\
-\CFA coroutine eager	& 144.7		& 143.9		& 1.5		\\
-\CFA thread				& 466.4		& 468.0		& 11.3		\\
-\uC coroutine			& 155.6		& 155.7		& 1.7		\\
-\uC thread				& 523.4		& 523.9		& 7.7		\\
-Python generator		& 123.2		& 124.3		& 4.1		\\
-Node.js generator		& 33.4		& 33.5		& 0.3		\\
-Goroutine thread		& 751.0		& 750.5		& 3.1		\\
-Rust tokio thread		& 1860.0	& 1881.1	& 37.6		\\
-Rust thread				& 53801.0	& 53896.8	& 274.9		\\
-Java thread (   10 000)		& 119256.0	& 119679.2	& 2244.0	\\
-Java thread (1 000 000)		& 123100.0	& 123052.5	& 751.6 	\\
-Pthreads thread			& 31465.5	& 31419.5	& 140.4
+\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA generator (1B)			& 0.6		& 0.6		& 0.0		\\
+\CFA coroutine lazy	(100M)	& 13.4		& 13.1		& 0.5		\\
+\CFA coroutine eager (10M)	& 144.7		& 143.9		& 1.5		\\
+\CFA thread (10M)			& 466.4		& 468.0		& 11.3		\\
+\uC coroutine (10M)			& 155.6		& 155.7		& 1.7		\\
+\uC thread (10M)			& 523.4		& 523.9		& 7.7		\\
+Python generator (10M)		& 123.2		& 124.3		& 4.1		\\
+Node.js generator (10M)		& 33.4		& 33.5		& 0.3		\\
+Goroutine thread (10M)		& 751.0		& 750.5		& 3.1		\\
+Rust tokio thread (10M)		& 1860.0	& 1881.1	& 37.6		\\
+Rust thread	(250K)			& 53801.0	& 53896.8	& 274.9		\\
+Java thread (250K)			& 119256.0	& 119679.2	& 2244.0	\\
+% Java thread (1 000 000)		& 123100.0	& 123052.5	& 751.6 	\\
+Pthreads thread	(250K)		& 31465.5	& 31419.5	& 140.4
 \end{tabular}
 \end{multicols}
@@ -3053,19 +3055,20 @@
 Internal scheduling is measured using a cycle of two threads signalling and waiting.
 Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
-Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking.
+Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
+User-level threading has one kernel thread, eliminating contention between the threads (direct handoff of the kernel thread).
+Kernel-level threading has two kernel threads allowing some contention.
 
 \begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
+\setlength{\tabcolsep}{3pt}
+\begin{cfa}[xleftmargin=0pt]
 volatile int go = 0;
-@condition c;@
-@monitor@ M {} m1/*, m2, m3, m4*/;
-void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
-	@signal( c );@
-}
-void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
+`condition c;`
+`monitor` M {} m1/*, m2, m3, m4*/;
+void call( M & `mutex p1/*, p2, p3, p4*/` ) {
+	`signal( c );`
+}
+void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
 	go = 1;	// continue other thread
-	for ( N ) { @wait( c );@ } );
+	for ( N ) { `wait( c );` } );
 }
 thread T {};
@@ -3092,13 +3095,13 @@
 
 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @signal@, 1 monitor	& 364.4		& 364.2		& 4.4		\\
-\CFA @signal@, 2 monitor	& 484.4		& 483.9		& 8.8		\\
-\CFA @signal@, 4 monitor	& 709.1		& 707.7		& 15.0		\\
-\uC @signal@ monitor		& 328.3		& 327.4		& 2.4		\\
-Rust cond. variable			& 7514.0	& 7437.4	& 397.2		\\
-Java @notify@ monitor (  1 000 000)		& 8717.0	& 8774.1	& 471.8		\\
-Java @notify@ monitor (100 000 000)		& 8634.0	& 8683.5	& 330.5		\\
-Pthreads cond. variable		& 5553.7	& 5576.1	& 345.6
+\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA @signal@, 1 monitor (10M)	& 364.4		& 364.2		& 4.4		\\
+\CFA @signal@, 2 monitor (10M)	& 484.4		& 483.9		& 8.8		\\
+\CFA @signal@, 4 monitor (10M)	& 709.1		& 707.7		& 15.0		\\
+\uC @signal@ monitor (10M)		& 328.3		& 327.4		& 2.4		\\
+Rust cond. variable	(1M)		& 7514.0	& 7437.4	& 397.2		\\
+Java @notify@ monitor (1M)		& 8717.0	& 8774.1	& 471.8		\\
+% Java @notify@ monitor (100 000 000)		& 8634.0	& 8683.5	& 330.5		\\
+Pthreads cond. variable (1M)	& 5553.7	& 5576.1	& 345.6
 \end{tabular}
 \end{multicols}
@@ -3109,14 +3112,14 @@
 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
 Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
-Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
+Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
 
 \begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
+\setlength{\tabcolsep}{5pt}
 \vspace*{-16pt}
-\begin{cfa}
-@monitor@ M {} m1/*, m2, m3, m4*/;
-void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
-void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
-	for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
+\begin{cfa}[xleftmargin=0pt]
+`monitor` M {} m1/*, m2, m3, m4*/;
+void call( M & `mutex p1/*, p2, p3, p4*/` ) {}
+void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
+	for ( N ) { `waitfor( call : p1/*, p2, p3, p4*/ );` }
 }
 thread T {};
@@ -3135,14 +3138,14 @@
 \columnbreak
 
-\vspace*{-16pt}
+\vspace*{-18pt}
 \captionof{table}{External-scheduling comparison (nanoseconds)}
 \label{t:schedext}
 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-\CFA @waitfor@, 1 monitor	& 367.1	& 365.3	& 5.0	\\
-\CFA @waitfor@, 2 monitor	& 463.0	& 464.6	& 7.1	\\
-\CFA @waitfor@, 4 monitor	& 689.6	& 696.2	& 21.5	\\
-\uC \lstinline[language=uC++]|_Accept| monitor	& 328.2	& 329.1	& 3.4	\\
-Go \lstinline[language=Golang]|select| channel	& 365.0	& 365.5	& 1.2
+\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+\CFA @waitfor@, 1 monitor (10M)	& 367.1	& 365.3	& 5.0	\\
+\CFA @waitfor@, 2 monitor (10M)	& 463.0	& 464.6	& 7.1	\\
+\CFA @waitfor@, 4 monitor (10M)	& 689.6	& 696.2	& 21.5	\\
+\uC \lstinline[language=uC++]|_Accept| monitor (10M)	& 328.2	& 329.1	& 3.4	\\
+Go \lstinline[language=Golang]|select| channel (10M)	& 365.0	& 365.5	& 1.2
 \end{tabular}
 \end{multicols}
@@ -3157,8 +3160,8 @@
 
 \begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}
-@monitor@ M {} m1/*, m2, m3, m4*/;
-call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
+\setlength{\tabcolsep}{3pt}
+\begin{cfa}[xleftmargin=0pt]
+`monitor` M {} m1/*, m2, m3, m4*/;
+call( M & `mutex p1/*, p2, p3, p4*/` ) {}
 int main() {
 	BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
@@ -3175,15 +3178,15 @@
 \label{t:mutex}
 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-test-and-test-set lock			& 19.1	& 18.9	& 0.4	\\
-\CFA @mutex@ function, 1 arg.	& 48.3	& 47.8	& 0.9	\\
-\CFA @mutex@ function, 2 arg.	& 86.7	& 87.6	& 1.9	\\
-\CFA @mutex@ function, 4 arg.	& 173.4	& 169.4	& 5.9	\\
-\uC @monitor@ member rtn.		& 54.8	& 54.8	& 0.1	\\
-Goroutine mutex lock			& 34.0	& 34.0	& 0.0	\\
-Rust mutex lock					& 33.0	& 33.2	& 0.8	\\
-Java synchronized method (   100 000 000)		& 31.0	& 30.9	& 0.5	\\
-Java synchronized method (10 000 000 000)		& 31.0 & 30.2 & 0.9 \\
-Pthreads mutex Lock				& 31.0	& 31.1	& 0.4
+\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+test-and-test-set lock (50M)		& 19.1	& 18.9	& 0.4	\\
+\CFA @mutex@ function, 1 arg. (50M)	& 48.3	& 47.8	& 0.9	\\
+\CFA @mutex@ function, 2 arg. (50M)	& 86.7	& 87.6	& 1.9	\\
+\CFA @mutex@ function, 4 arg. (50M)	& 173.4	& 169.4	& 5.9	\\
+\uC @monitor@ member rtn. (50M)		& 54.8	& 54.8	& 0.1	\\
+Goroutine mutex lock (50M)			& 34.0	& 34.0	& 0.0	\\
+Rust mutex lock (50M)				& 33.0	& 33.2	& 0.8	\\
+Java synchronized method (50M)		& 31.0	& 30.9	& 0.5	\\
+% Java synchronized method (10 000 000 000)		& 31.0 & 30.2 & 0.9 \\
+Pthreads mutex Lock (50M)			& 31.0	& 31.1	& 0.4
 \end{tabular}
 \end{multicols}
@@ -3214,15 +3217,14 @@
 
 \begin{multicols}{2}
-\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
-\begin{cfa}[aboveskip=0pt,belowskip=0pt]
-@coroutine@ C {};
-void main( C & ) { for () { @suspend;@ } }
+\begin{cfa}[xleftmargin=0pt]
+`coroutine` C {};
+void main( C & ) { for () { `suspend;` } }
 int main() { // coroutine test
 	C c;
-	BENCH( for ( N ) { @resume( c );@ } )
+	BENCH( for ( N ) { `resume( c );` } )
 	sout | result;
 }
 int main() { // thread test
-	BENCH( for ( N ) { @yield();@ } )
+	BENCH( for ( N ) { `yield();` } )
 	sout | result;
 }
@@ -3237,22 +3239,22 @@
 \label{t:ctx-switch}
 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
-\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
-C function			& 1.8		& 1.8		& 0.0	\\
-\CFA generator		& 1.8		& 2.0		& 0.3	\\
-\CFA coroutine		& 32.5		& 32.9		& 0.8	\\
-\CFA thread			& 93.8		& 93.6		& 2.2	\\
-\uC coroutine		& 50.3		& 50.3		& 0.2	\\
-\uC thread			& 97.3		& 97.4		& 1.0	\\
-Python generator	& 40.9		& 41.3		& 1.5	\\
-Node.js await		& 1852.2	& 1854.7	& 16.4	\\
-Node.js generator	& 33.3		& 33.4		& 0.3	\\
-Goroutine thread	& 143.0		& 143.3		& 1.1	\\
-Rust async await	& 32.0		& 32.0		& 0.0	\\
-Rust tokio thread	& 143.0		& 143.0		& 1.7	\\
-Rust thread			& 332.0		& 331.4		& 2.4	\\
-Java thread	(      100 000)		& 405.0		& 415.0		& 17.6	\\
-Java thread (  100 000 000)			& 413.0 & 414.2 & 6.2 \\
-Java thread (5 000 000 000)			& 415.0 & 415.2 & 6.1 \\
-Pthreads thread		& 334.3		& 335.2		& 3.9
+\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+C function (10B)			& 1.8		& 1.8		& 0.0	\\
+\CFA generator (5B)			& 1.8		& 2.0		& 0.3	\\
+\CFA coroutine (100M)		& 32.5		& 32.9		& 0.8	\\
+\CFA thread (100M)			& 93.8		& 93.6		& 2.2	\\
+\uC coroutine (100M)		& 50.3		& 50.3		& 0.2	\\
+\uC thread (100M)			& 97.3		& 97.4		& 1.0	\\
+Python generator (100M)		& 40.9		& 41.3		& 1.5	\\
+Node.js await (5M)			& 1852.2	& 1854.7	& 16.4	\\
+Node.js generator (100M)	& 33.3		& 33.4		& 0.3	\\
+Goroutine thread (100M)		& 143.0		& 143.3		& 1.1	\\
+Rust async await (100M)		& 32.0		& 32.0		& 0.0	\\
+Rust tokio thread (100M)	& 143.0		& 143.0		& 1.7	\\
+Rust thread (25M)			& 332.0		& 331.4		& 2.4	\\
+Java thread (100M)			& 405.0		& 415.0		& 17.6	\\
+% Java thread (  100 000 000)			& 413.0 & 414.2 & 6.2 \\
+% Java thread (5 000 000 000)			& 415.0 & 415.2 & 6.1 \\
+Pthreads thread (25M)		& 334.3		& 335.2		& 3.9
 \end{tabular}
 \end{multicols}
@@ -3263,8 +3265,11 @@
 Languages using 1:1 threading based on pthreads can at best meet or exceed, due to language overhead, the pthread results.
 Note, pthreads has a fast zero-contention mutex lock checked in user space.
-Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions.
+Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions (context-switching or locking).
+As well, for locking experiments, M:N threading has less contention if only one kernel thread is used.
 Languages with stackful coroutines have higher cost than stackless coroutines because of stack allocation and context switching;
 however, stackful \uC and \CFA coroutines have approximately the same performance as stackless Python and Node.js generators.
 The \CFA stackless generator is approximately 25 times faster for suspend/resume and 200 times faster for creation than stackless Python and Node.js generators.
+The Node.js context-switch is costly when asynchronous await must enter the event engine because a promise is not fulfilled.
+Finally, the benchmark results correlate across programming languages with and without JIT, indicating the JIT has completed any runtime optimizations.
 
 
@@ -3324,5 +3329,5 @@
 
 The authors recognize the design assistance of Aaron Moss, Rob Schluntz, Andrew Beach, and Michael Brooks; David Dice for commenting and helping with the Java benchmarks; and Gregor Richards for helping with the Node.js benchmarks.
-This research is funded by a grant from Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
+This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
 
 {%