Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision c8ad5d9ee9d1ad9027a536667adcb82740622d91)
+++ doc/papers/concurrency/Paper.tex	(revision a87c86f9e6e35299eacf4b1ad3dbeb024027be3e)
@@ -22,5 +22,5 @@
 \captionsetup{justification=raggedright,singlelinecheck=false}
 \usepackage{siunitx}
-\sisetup{ binary-units=true }
+\sisetup{binary-units=true}
 
 \hypersetup{breaklinks=true}
@@ -32,5 +32,5 @@
 \renewcommand{\linenumberfont}{\scriptsize\sffamily}
 
-\renewcommand{\textfraction}{0.0}	% the entire page maybe devoted to floats with no text on the page at all
+\renewcommand{\textfraction}{0.0}			% the entire page maybe devoted to floats with no text on the page at all
 
 \lefthyphenmin=3							% hyphen only after 4 characters
@@ -70,4 +70,6 @@
 %\DeclareTextCommandDefault{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.1ex}}}
 \renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}}
+%\def\myCHarFont{\fontencoding{T1}\selectfont}%
+% \def\{{\ttfamily\upshape\myCHarFont \char`\}}}%
 
 \makeatletter
@@ -244,9 +246,6 @@
 \maketitle
 
-% ======================================================================
-% ======================================================================
+
 \section{Introduction}
-% ======================================================================
-% ======================================================================
 
 This paper provides a minimal concurrency \newterm{Abstract Program Interface} (API) that is simple, efficient and can be used to build other concurrency features.
@@ -254,5 +253,5 @@
 An easier approach for programmers is to support higher-level constructs as the basis of concurrency.
 Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{Hochstein05}.
-Examples of high-level approaches are task based~\cite{TBB}, message passing~\cite{Erlang,MPI}, and implicit threading~\cite{OpenMP}.
+Examples of high-level approaches are task (work) based~\cite{TBB}, implicit threading~\cite{OpenMP}, monitors~\cite{Java}, channels~\cite{CSP,Go}, and message passing~\cite{Erlang,MPI}.
 
 This paper uses the following terminology.
@@ -272,202 +271,280 @@
 The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantic, and type checking, and the corresponding high-perforamnce runtime-library to implement the concurrency features.
 
-% ======================================================================
-% ======================================================================
+
 \section{\CFA Overview}
-% ======================================================================
-% ======================================================================
 
 The following is a quick introduction to the \CFA language, specifically tailored to the features needed to support concurrency.
-Most of the following code examples can be found on the \CFA website~\cite{Cforall}.
-
-\CFA is an extension of ISO-C, and therefore, supports all of the same paradigms as C.
+Extended versions of the following code examples are available at the \CFA website~\cite{Cforall} or Moss~\etal~\cite{XXX}.
+
+\CFA is an extension of ISO-C, and hence, supports all C paradigms.
 %It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily.
-Like C, the basics of \CFA revolve around structures and routines, which are thin abstractions over machine code.
-The vast majority of the code produced by the \CFA translator respects memory layouts and calling conventions laid out by C.
-Interestingly, while \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and inheritance, it does have some notion of objects\footnote{C defines the term objects as : ``region of data storage in the execution environment, the contents of which can represent
-values''~\cite[3.15]{C11}}, most importantly construction and destruction of objects.
+Like C, the basics of \CFA revolve around structures and functions.
+Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions.
+While \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}.
 
 
 \subsection{References}
 
-Like \CC, \CFA introduces rebind-able references providing multiple dereferencing as an alternative to pointers.
-In regards to concurrency, the semantic difference between pointers and references are not particularly relevant, but since this document uses mostly references, here is a quick overview of the semantics:
-\begin{cfa}
-int x, y, z;
-int * p1 = &x, ** p2 = &p1, *** p3 = &p2,	$\C{// pointers to x}$
-	& r1 = x,   && r2 = r1, &&& r3 = r2;	$\C{// references to x}$
-
-*p1 = 3; **p2 = 3; ***p3 = 3;				$\C{// change x}$
-  r1 = 3;    r2 = 3;      r3 = 3;			$\C{// change x}$
-**p3 = &y; *p3 = &z;						$\C{// change p1, p2}$
-&&r3 = &y; &r3 = &z;						$\C{// change p1, p2}$
-int & ar[3] = {x, y, z};					$\C{// initialize array of references}$
-
-typeof( ar[1]) p;							$\C{// is int, referenced object type}$
-typeof(&ar[1]) q;							$\C{// is int \&, reference type}$
-sizeof( ar[1]) == sizeof(int);				$\C{// is true, referenced object size}$
-sizeof(&ar[1]) == sizeof(int *);			$\C{// is true, reference size}$
-\end{cfa}
-The important take away from this code example is that a reference offers a handle to an object, much like a pointer, but which is automatically dereferenced for convenience.
-
-% ======================================================================
+\CFA provides multi-level rebindable references, as an alternative to pointers, which significantly reduces syntactic noise.
+\begin{cfa}
+int x = 1, y = 2, z = 3;
+int * p1 = &x, ** p2 = &p1,  *** p3 = &p2,	$\C{// pointers to x}$
+	`&` r1 = x,  `&&` r2 = r1,  `&&&` r3 = r2;	$\C{// references to x}$
+int * p4 = &z, `&` r4 = z;
+
+*p1 = 3; **p2 = 3; ***p3 = 3;       // change x
+r1 =  3;     r2 = 3;      r3 = 3;        // change x: implicit dereferences *r1, **r2, ***r3
+**p3 = &y; *p3 = &p4;                // change p1, p2
+`&`r3 = &y; `&&`r3 = &`&`r4;             // change r1, r2: cancel implicit deferences (&*)**r3, (&(&*)*)*r3, &(&*)r4
+\end{cfa}
+A reference is a handle to an object, like a pointer, but is automatically dereferenced the specified number of levels.
+Referencing (address-of @&@) a reference variable cancels one of the implicit dereferences, until there are no more implicit references, after which normal expression behaviour applies.
+
+
+\subsection{\texorpdfstring{\protect\lstinline{with} Statement}{with Statement}}
+\label{s:WithStatement}
+
+Heterogeneous data is often aggregated into a structure/union.
+To reduce syntactic noise, \CFA provides a @with@ statement (see Pascal~\cite[\S~4.F]{Pascal}) to elide aggregate field-qualification by opening a scope containing the field identifiers.
+\begin{cquote}
+\vspace*{-\baselineskip}%???
+\lstDeleteShortInline@%
+\begin{cfa}
+struct S { char c; int i; double d; };
+struct T { double m, n; };
+// multiple aggregate parameters
+\end{cfa}
+\begin{tabular}{@{}l@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}}
+\begin{cfa}
+void f( S & s, T & t ) {
+	`s.`c; `s.`i; `s.`d;
+	`t.`m; `t.`n;
+}
+\end{cfa}
+&
+\begin{cfa}
+void f( S & s, T & t ) `with ( s, t )` {
+	c; i; d;		// no qualification
+	m; n;
+}
+\end{cfa}
+\end{tabular}
+\lstMakeShortInline@%
+\end{cquote}
+Object-oriented programming languages only provide implicit qualification for the receiver.
+
+In detail, the @with@ statement has the form:
+\begin{cfa}
+$\emph{with-statement}$:
+	'with' '(' $\emph{expression-list}$ ')' $\emph{compound-statement}$
+\end{cfa}
+and may appear as the body of a function or nested within a function body.
+Each expression in the expression-list provides a type and object.
+The type must be an aggregate type.
+(Enumerations are already opened.)
+The object is the implicit qualifier for the open structure-fields.
+All expressions in the expression list are open in parallel within the compound statement, which is different from Pascal, which nests the openings from left to right.
+
+
 \subsection{Overloading}
 
-Another important feature of \CFA is function overloading as in Java and \CC, where routines with the same name are selected based on the number and type of the arguments.
-As well, \CFA uses the return type as part of the selection criteria, as in Ada~\cite{Ada}.
-For routines with multiple parameters and returns, the selection is complex.
+\CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem.
+Both variables and functions may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments.
+\begin{cquote}
+\vspace*{-\baselineskip}%???
+\lstDeleteShortInline@%
+\begin{cfa}
+// selection based on type
+\end{cfa}
+\begin{tabular}{@{}l@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}}
+\begin{cfa}
+const short int MIN = -32768;
+const int MIN = -2147483648;
+const long int MIN = -9223372036854775808L;
+\end{cfa}
+&
+\begin{cfa}
+short int si = MIN;
+int i = MIN;
+long int li = MIN;
+\end{cfa}
+\end{tabular}
 \begin{cfa}
 // selection based on type and number of parameters
-void f(void);			$\C{// (1)}$
-void f(char);			$\C{// (2)}$
-void f(int, double);	$\C{// (3)}$
-f();					$\C{// select (1)}$
-f('a');					$\C{// select (2)}$
-f(3, 5.2);				$\C{// select (3)}$
-
-// selection based on  type and number of returns
-char   f(int);			$\C{// (1)}$
-double f(int);			$\C{// (2)}$
-char   c = f(3);		$\C{// select (1)}$
-double d = f(4);		$\C{// select (2)}$
-\end{cfa}
-This feature is particularly important for concurrency since the runtime system relies on creating different types to represent concurrency objects.
+\end{cfa}
+\begin{tabular}{@{}l@{\hspace{1.7\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}}
+\begin{cfa}
+void f( void );
+void f( char );
+void f( int, double );
+\end{cfa}
+&
+\begin{cfa}
+f();
+f( 'a' );
+f( 3, 5.2 );
+\end{cfa}
+\end{tabular}
+\begin{cfa}
+// selection based on type and number of returns
+\end{cfa}
+\begin{tabular}{@{}l@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}}
+\begin{cfa}
+char f( int );
+double f( int );
+[char, double] f( int );
+\end{cfa}
+&
+\begin{cfa}
+char c = f( 3 );
+double d = f( 3 );
+[d, c] = f( 3 );
+\end{cfa}
+\end{tabular}
+\lstMakeShortInline@%
+\end{cquote}
+Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects.
 Therefore, overloading is necessary to prevent the need for long prefixes and other naming conventions that prevent name clashes.
-As seen in section \ref{basics}, routine @main@ is an example that benefits from overloading.
-
-% ======================================================================
+As seen in Section~\ref{basics}, function @main@ is heavily overloaded.
+
+Variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name:
+\begin{cfa}
+struct S { int `i`; int j; double m; } s;
+struct T { int `i`; int k; int m; } t;
+with ( s, t ) {
+	j + k;									$\C{// unambiguous, s.j + t.k}$
+	m = 5.0;								$\C{// unambiguous, t.m = 5.0}$
+	m = 1;									$\C{// unambiguous, s.m = 1}$
+	int a = m;								$\C{// unambiguous, a = s.i }$
+	double b = m;							$\C{// unambiguous, b = t.m}$
+	int c = `s.i` + `t.i`;					$\C{// unambiguous, qualification}$
+	(double)m;								$\C{// unambiguous, cast}$
+}
+\end{cfa}
+For parallel semantics, both @s.i@ and @t.i@ are visible with the same type, so only @i@ is ambiguous without qualification.
+
+
 \subsection{Operators}
+
 Overloading also extends to operators.
-The syntax for denoting operator-overloading is to name a routine with the symbol of the operator and question marks where the arguments of the operation appear, \eg:
-\begin{cfa}
-int ++? (int op);              		$\C{// unary prefix increment}$
-int ?++ (int op);              		$\C{// unary postfix increment}$
-int ?+? (int op1, int op2);    		$\C{// binary plus}$
-int ?<=?(int op1, int op2);   		$\C{// binary less than}$
-int ?=? (int & op1, int op2);  		$\C{// binary assignment}$
-int ?+=?(int & op1, int op2); 		$\C{// binary plus-assignment}$
-
-struct S {int i, j;};
-S ?+?(S op1, S op2) {				$\C{// add two structures}$
+Operator-overloading syntax names a routine with the operator symbol and question marks for the operands:
+\begin{cquote}
+\lstDeleteShortInline@%
+\begin{tabular}{@{}ll@{\hspace{\parindentlnth}}|@{\hspace{\parindentlnth}}l@{}}
+\begin{cfa}
+int ++? (int op);
+int ?++ (int op);
+int `?+?` (int op1, int op2);
+int ?<=?(int op1, int op2);
+int ?=? (int & op1, int op2);
+int ?+=?(int & op1, int op2);
+\end{cfa}
+&
+\begin{cfa}
+// unary prefix increment
+// unary postfix increment
+// binary plus
+// binary less than
+// binary assignment
+// binary plus-assignment
+\end{cfa}
+&
+\begin{cfa}
+struct S { int i, j; };
+S `?+?`( S op1, S op2) { // add two structures
 	return (S){op1.i + op2.i, op1.j + op2.j};
 }
 S s1 = {1, 2}, s2 = {2, 3}, s3;
-s3 = s1 + s2;						$\C{// compute sum: s3 == {2, 5}}$
-\end{cfa}
-While concurrency does not use operator overloading directly, this feature is more important as an introduction for the syntax of constructors.
-
-% ======================================================================
-\subsection{Constructors/Destructors}
-Object lifetime is often a challenge in concurrency. \CFA uses the approach of giving concurrent meaning to object lifetime as a means of synchronization and/or mutual exclusion.
-Since \CFA relies heavily on the lifetime of objects, constructors and destructors is a core feature required for concurrency and parallelism. \CFA uses the following syntax for constructors and destructors:
-\begin{cfa}
-struct S {
-	size_t size;
-	int * ia;
-};
-void ?{}(S & s, int asize) {	$\C{// constructor operator}$
-	s.size = asize;				$\C{// initialize fields}$
-	s.ia = calloc(size, sizeof(S));
-}
-void ^?{}(S & s) {				$\C{// destructor operator}$
-	free(ia);					$\C{// de-initialization fields}$
-}
-int main() {
-	S x = {10}, y = {100};		$\C{// implicit calls: ?\{\}(x, 10), ?\{\}(y, 100)}$
-	...							$\C{// use x and y}$
-	^x{};  ^y{};				$\C{// explicit calls to de-initialize}$
-	x{20};  y{200};				$\C{// explicit calls to reinitialize}$
-	...							$\C{// reuse x and y}$
-}								$\C{// implicit calls: \^?\{\}(y), \^?\{\}(x)}$
-\end{cfa}
-The language guarantees that every object and all their fields are constructed.
-Like \CC, construction of an object is automatically done on allocation and destruction of the object is done on deallocation.
-Allocation and deallocation can occur on the stack or on the heap.
-\begin{cfa}
-{
-	struct S s = {10};	$\C{// allocation, call constructor}$
-	...
-}						$\C{// deallocation, call destructor}$
-struct S * s = new();	$\C{// allocation, call constructor}$
-...
-delete(s);				$\C{// deallocation, call destructor}$
-\end{cfa}
-Note that like \CC, \CFA introduces @new@ and @delete@, which behave like @malloc@ and @free@ in addition to constructing and destructing objects, after calling @malloc@ and before calling @free@, respectively.
-
-% ======================================================================
+s3 = s1 `+` s2;		// compute sum: s3 == {2, 5}
+\end{cfa}
+\end{tabular}
+\lstMakeShortInline@%
+\end{cquote}
+While concurrency does not use operator overloading directly, it provides an introduction for the syntax of constructors.
+
+
 \subsection{Parametric Polymorphism}
 \label{s:ParametricPolymorphism}
-Routines in \CFA can also be reused for multiple types.
-This capability is done using the @forall@ clauses, which allow separately compiled routines to support generic usage over multiple types.
+
+The signature feature of \CFA is parametric-polymorphic functions~\cite{} with functions generalized using a @forall@ clause (giving the language its name), which allow separately compiled routines to support generic usage over multiple types.
 For example, the following sum function works for any type that supports construction from 0 and addition:
 \begin{cfa}
-// constraint type, 0 and +
-forall(otype T | { void ?{}(T *, zero_t); T ?+?(T, T); })
-T sum(T a[ ], size_t size) {
-	T total = 0;				$\C{// construct T from 0}$
-	for(size_t i = 0; i < size; i++)
-		total = total + a[i];	$\C{// select appropriate +}$
+forall( otype T | { void `?{}`( T *, zero_t ); T `?+?`( T, T ); } ) // constraint type, 0 and +
+T sum( T a[$\,$], size_t size ) {
+	`T` total = { `0` };					$\C{// initialize by 0 constructor}$
+	for ( size_t i = 0; i < size; i += 1 )
+		total = total `+` a[i];				$\C{// select appropriate +}$
 	return total;
 }
-
 S sa[5];
-int i = sum(sa, 5);				$\C{// use S's 0 construction and +}$
-\end{cfa}
-
-Since writing constraints on types can become cumbersome for more constrained functions, \CFA also has the concept of traits.
-Traits are named collection of constraints that can be used both instead and in addition to regular constraints:
-\begin{cfa}
-trait summable( otype T ) {
-	void ?{}(T *, zero_t);		$\C{// constructor from 0 literal}$
-	T ?+?(T, T);				$\C{// assortment of additions}$
-	T ?+=?(T *, T);
-	T ++?(T *);
-	T ?++(T *);
+int i = sum( sa, 5 );						$\C{// use S's 0 construction and +}$
+\end{cfa}
+
+\CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each function declaration:
+\begin{cfa}
+trait `sumable`( otype T ) {
+	void `?{}`( T &, zero_t );				$\C{// 0 literal constructor}$
+	T `?+?`( T, T );						$\C{// assortment of additions}$
+	T ?+=?( T &, T );
+	T ++?( T & );
+	T ?++( T & );
 };
-forall( otype T | summable(T) )	$\C{// use trait}$
-T sum(T a[], size_t size);
-\end{cfa}
-
-Note that the type use for assertions can be either an @otype@ or a @dtype@.
-Types declared as @otype@ refer to ``complete'' objects, \ie objects with a size, a default constructor, a copy constructor, a destructor and an assignment operator.
-Using @dtype@, on the other hand, has none of these assumptions but is extremely restrictive, it only guarantees the object is addressable.
-
-% ======================================================================
-\subsection{with Clause/Statement}
-Since \CFA lacks the concept of a receiver, certain functions end up needing to repeat variable names often.
-To remove this inconvenience, \CFA provides the @with@ statement, which opens an aggregate scope making its fields directly accessible (like Pascal).
-\begin{cfa}
-struct S { int i, j; };
-int mem(S & this) with (this)		$\C{// with clause}$
-	i = 1;							$\C{// this->i}$
-	j = 2;							$\C{// this->j}$
-}
-int foo() {
-	struct S1 { ... } s1;
-	struct S2 { ... } s2;
-	with (s1) 						$\C{// with statement}$
-	{
-		// access fields of s1 without qualification
-		with (s2)					$\C{// nesting}$
-		{
-			// access fields of s1 and s2 without qualification
-		}
-	}
-	with (s1, s2) 					$\C{// scopes open in parallel}$
-	{
-		// access fields of s1 and s2 without qualification
-	}
-}
-\end{cfa}
-
-For more information on \CFA see \cite{cforall-ug,Schluntz17,www-cfa}.
-
-% ======================================================================
-% ======================================================================
+forall( otype T `| sumable( T )` )			$\C{// use trait}$
+T sum( T a[$\,$], size_t size );
+\end{cfa}
+
+Assertions can be @otype@ or @dtype@.
+@otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator.
+@dtype@ only guarantees an object has a size and alignment.
+
+Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@:
+\begin{cfa}
+forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); }
+int * ip = alloc();							$\C{// select type and size from left-hand side}$
+double * dp = alloc();
+struct S {...} * sp = alloc();
+\end{cfa}
+where the return type supplies the type/size of the allocation, which is impossible in most type systems.
+
+
+\subsection{Constructors / Destructors}
+
+Object lifetime is a challenge in non-managed programming languages.
+\CFA responds with \CC-like constructors and destructors:
+\begin{cfa}
+struct VLA { int len, * data; };			$\C{// variable length array of integers}$
+void ?{}( VLA & vla ) with ( vla ) { len = 10;  data = alloc( len ); }  // default constructor
+void ?{}( VLA & vla, int size, char fill ) with ( vla ) { len = size;  data = alloc( len, fill ); } // initialization
+void ?{}( VLA & vla, VLA other ) { vla.len = other.len;  vla.data = other.data; } // copy, shallow
+void ^?{}( VLA & vla ) with ( vla ) { free( data ); } $\C{// destructor}$
+{
+	VLA  x,            y = { 20, 0x01 },     z = y;	$\C{// z points to y}$
+	//    ?{}( x );   ?{}( y, 20, 0x01 );   ?{}( z, y ); 
+	^x{};									$\C{// deallocate x}$
+	x{};									$\C{// reallocate x}$
+	z{ 5, 0xff };							$\C{// reallocate z, not pointing to y}$
+	^y{};									$\C{// deallocate y}$
+	y{ x };									$\C{// reallocate y, points to x}$
+	x{};									$\C{// reallocate x, not pointing to y}$
+	// ^?{}(z);  ^?{}(y);  ^?{}(x);
+}
+\end{cfa}
+Like \CC, construction is implicit on allocation (stack/heap) and destruction is implicit on deallocation.
+The object and all their fields are constructed/destructed.
+\CFA also provides @new@ and @delete@, which behave like @malloc@ and @free@, in addition to constructing and destructing objects:
+\begin{cfa}
+{	struct S s = {10};						$\C{// allocation, call constructor}$
+	...
+}											$\C{// deallocation, call destructor}$
+struct S * s = new();						$\C{// allocation, call constructor}$
+...
+delete( s );								$\C{// deallocation, call destructor}$
+\end{cfa}
+\CFA concurrency uses object lifetime as a means of synchronization and/or mutual exclusion.
+
+
 \section{Concurrency Basics}\label{basics}
-% ======================================================================
-% ======================================================================
-
-At its core, concurrency is based on having multiple call-stacks and scheduling among threads of execution executing on these stacks.
+
+At its core, concurrency is based on multiple call-stacks and scheduling among threads executing on these stacks.
 Multiple call stacks (or contexts) and a single thread of execution does \emph{not} imply concurrency.
 Execution with a single thread and multiple stacks where the thread is deterministically self-scheduling across the stacks is called \newterm{coroutining};
@@ -585,4 +662,5 @@
 \begin{lstlisting}[aboveskip=0pt,belowskip=0pt]
 `coroutine` Fib { int fn; };
+void ?{}( Fibonacci & fib ) with( fib ) { fn = 0; }
 void main( Fib & f ) with( f ) {
 	int f1, f2;