Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 59c034c60dc140fa869761785dbeccbd2624f249)
+++ doc/papers/concurrency/Paper.tex	(revision 6f9bc099899d715ff786f4df583b95df6850f3b1)
@@ -741,12 +741,12 @@
 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's resume.
 The interface function, @next@, takes a Fibonacci instance and context switches to it using @resume@;
-on return, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
+on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned.
 The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates the stack;
 when the coroutine main returns, its stack is deallocated.
 Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes.
 Figure~\ref{f:Coroutine1State} shows the coroutine version of the C version in Figure~\ref{f:ExternalState}.
-Coroutine generators are called \newterm{output coroutines} because values are returned by the coroutine.
-
-Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of character blocks of fixed size.
+Coroutine generators are called \newterm{output coroutines} because values are only returned.
+
+Figure~\ref{f:CFAFmt} shows an \newterm{input coroutine}, @Format@, for restructuring text into groups of characters of fixed-size blocks.
 For example, the input of the left is reformatted into the output on the right.
 \begin{quote}
@@ -763,5 +763,5 @@
 \end{tabular}
 \end{quote}
-The example takes advantage of resuming coroutines in the constructor to prime the coroutine loops so the first character sent for formatting appears inside the nested loops.
+The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops.
 The destruction provides a newline if formatted text ends with a full line.
 Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flatten (linearized) and rechecked on each call because execution location is not retained between calls.
@@ -778,5 +778,5 @@
 void main( Format & fmt ) with( fmt ) {
 	for ( ;; ) {	
-		for ( g = 0; g < 5; g += 1 ) {  // group
+		for ( g = 0; g < 5; g += 1 ) {      // group
 			for ( b = 0; b < 4; b += 1 ) { // block
 				`suspend();`
@@ -814,5 +814,5 @@
 };
 void format( struct Format * fmt ) {
-	if ( fmt->ch != -1 ) { // not EOF
+	if ( fmt->ch != -1 ) {      // not EOF ?
 		printf( "%c", fmt->ch );
 		fmt->b += 1;
@@ -823,5 +823,5 @@
 		}
 		if ( fmt->g == 5 ) {  // group
-			printf( "\n" );      // separator
+			printf( "\n" );     // separator
 			fmt->g = 0;
 		}
@@ -850,6 +850,6 @@
 
 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming function for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal functions.
-However, there is no stack growth because @resume@/@suspend@ context switch to an existing stack frames rather than create a new one.
-\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a cycle.
+However, there is no stack growth because @resume@/@suspend@ context switch to existing stack-frames rather than create new ones.
+\newterm{Symmetric (full) coroutine}s have a coroutine call a resuming function for another coroutine, which eventually forms a resuming-call cycle.
 (The trivial cycle is a coroutine resuming itself.)
 This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch.
@@ -935,17 +935,17 @@
 The @start@ function communicates both the number of elements to be produced and the consumer into the producer's coroutine structure.
 Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
-@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random vales, calling the consumer to deliver the values, and printing the status returned from the consumer.
+@prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer.
 
 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
 For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine.
-The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer's @payment@ member, and on return prints the receipt from the producer and increments the money for the next payment.
-The call from the consumer to the producer's @payment@ member introduces the cycle between producer and consumer.
+The consumer iterates until the @done@ flag is set, prints, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation).
+The call from the consumer to the @payment@ introduces the cycle between producer and consumer.
 When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed.
-The context switch restarts the producer at the point where it was last context switched and it continues in member @delivery@ after the resume.
-
-The @delivery@ member returns the status value in @prod@'s @main@ member, where the status is printed.
+The context switch restarts the producer at the point where it was last context switched, so it continues in @delivery@ after the resume.
+
+@delivery@ returns the status value in @prod@'s coroutine main, where the status is printed.
 The loop then repeats calling @delivery@, where each call resumes the consumer coroutine.
 The context switch to the consumer continues in @payment@.
-The consumer increments and returns the receipt to the call in @cons@'s @main@ member.
+The consumer increments and returns the receipt to the call in @cons@'s coroutine main.
 The loop then repeats calling @payment@, where each call resumes the producer coroutine.
 
@@ -954,105 +954,34 @@
 The context switch restarts @cons@ in @payment@ and it returns with the last receipt.
 The consumer terminates its loops because @done@ is true, its @main@ terminates, so @cons@ transitions from a coroutine back to an object, and @prod@ reactivates after the resume in @stop@.
-The @stop@ member returns and @prod@'s @main@ member terminates.
+@stop@ returns and @prod@'s coroutine main terminates.
 The program main restarts after the resume in @start@.
-The @start@ member returns and the program main terminates.
-
-
-\subsubsection{Construction}
-
-One important design challenge for implementing coroutines and threads (shown in section \ref{threads}) is that the runtime system needs to run code after the user-constructor runs to connect the fully constructed object into the system.
-In the case of coroutines, this challenge is simpler since there is no non-determinism from preemption or scheduling.
-However, the underlying challenge remains the same for coroutines and threads.
-
-The runtime system needs to create the coroutine's stack and, more importantly, prepare it for the first resumption.
-The timing of the creation is non-trivial since users expect both to have fully constructed objects once execution enters the coroutine main and to be able to resume the coroutine from the constructor.
-There are several solutions to this problem but the chosen option effectively forces the design of the coroutine.
-
-Furthermore, \CFA faces an extra challenge as polymorphic routines create invisible thunks when cast to non-polymorphic routines and these thunks have function scope.
-For example, the following code, while looking benign, can run into undefined behaviour because of thunks:
-
-\begin{cfa}
-// async: Runs function asynchronously on another thread
-forall(otype T)
-extern void async(void (*func)(T*), T* obj);
-
-forall(otype T)
-void noop(T*) {}
-
-void bar() {
-	int a;
-	async(noop, &a); // start thread running noop with argument a
-}
-\end{cfa}
-
-The generated C code\footnote{Code trimmed down for brevity} creates a local thunk to hold type information:
-
-\begin{cfa}
-extern void async(/* omitted */, void (*func)(void*), void* obj);
-
-void noop(/* omitted */, void* obj){}
-
-void bar(){
-	int a;
-	void _thunk0(int* _p0){
-		/* omitted */
-		noop(/* omitted */, _p0);
-	}
-	/* omitted */
-	async(/* omitted */, ((void (*)(void*))(&_thunk0)), (&a));
-}
-\end{cfa}
-The problem in this example is a storage management issue, the function pointer @_thunk0@ is only valid until the end of the block, which limits the viable solutions because storing the function pointer for too long causes undefined behaviour; \ie the stack-based thunk being destroyed before it can be used.
-This challenge is an extension of challenges that come with second-class routines.
-Indeed, GCC nested routines also have the limitation that nested routine cannot be passed outside of the declaration scope.
-The case of coroutines and threads is simply an extension of this problem to multiple call stacks.
-
-
-\subsubsection{Alternative: Composition}
-
-One solution to this challenge is to use composition/containment, where coroutine fields are added to manage the coroutine.
-
-\begin{cfa}
-struct Fibonacci {
-	int fn; // used for communication
-	coroutine c; // composition
-};
-
-void FibMain(void*) {
-	//...
-}
-
-void ?{}(Fibonacci& this) {
-	this.fn = 0;
-	// Call constructor to initialize coroutine
-	(this.c){myMain};
-}
-\end{cfa}
-The downside of this approach is that users need to correctly construct the coroutine handle before using it.
-Like any other objects, the user must carefully choose construction order to prevent usage of objects not yet constructed.
-However, in the case of coroutines, users must also pass to the coroutine information about the coroutine main, like in the previous example.
-This opens the door for user errors and requires extra runtime storage to pass at runtime information that can be known statically.
-
-
-\subsubsection{Alternative: Reserved keyword}
-
-The next alternative is to use language support to annotate coroutines as follows:
-\begin{cfa}
-coroutine Fibonacci {
-	int fn; // used for communication
-};
-\end{cfa}
-The @coroutine@ keyword means the compiler can find and inject code where needed.
-The downside of this approach is that it makes coroutine a special case in the language.
-Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
-Furthermore, implementing coroutines without language supports also displays the power of the programming language used.
-While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed by users without using the language support.
-The reserved keywords are only present to improve ease of use for the common cases.
-
-
-\subsubsection{Alternative: Lambda Objects}
+@start@ returns and the program main terminates.
+
+
+\subsection{Coroutine Implementation}
+
+A significant implementation challenge for coroutines (and threads, see section \ref{threads}) is adding extra fields and executing code after/before the coroutine constructor/destructor and coroutine main to create/initialize/de-initialize/destroy extra fields and the stack.
+There are several solutions to this problem and the chosen option forced the \CFA coroutine design.
+
+Object-oriented inheritance provides extra fields and code in a restricted context, but it requires programmers to explicitly perform the inheritance:
+\begin{cfa}
+struct mycoroutine $\textbf{\textsf{inherits}}$ baseCoroutine { ... }
+\end{cfa}
+and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack.
+Furthermore, the execution of constructs/destructors is in the wrong order for certain operations, \eg for threads;
+\eg, if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
+
+An alternatively is composition:
+\begin{cfa}
+struct mycoroutine {
+	... // declarations
+	baseCoroutine dummy; // composition, last declaration
+}
+\end{cfa}
+which also requires an explicit declaration that must be the last one to ensure correct initialization order.
+However, there is nothing preventing wrong placement or multiple declarations.
 
 For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}.
-For example, Boost implements coroutines in terms of four functor object types:
+For example, Boost implements coroutines in terms of four functor object-types:
 \begin{cfa}
 asymmetric_coroutine<>::pull_type
@@ -1061,94 +990,115 @@
 symmetric_coroutine<>::yield_type
 \end{cfa}
-Often, the canonical threading paradigm in languages is based on function pointers, @pthread@ being one of the most well-known examples.
-The main problem of this approach is that the thread usage is limited to a generic handle that must otherwise be wrapped in a custom type.
-Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda based coroutines adds very little.
-
-A variation of this would be to use a simple function pointer in the same way @pthread@ does for threads:
-\begin{cfa}
-void foo( coroutine_t cid, void* arg ) {
-	int* value = (int*)arg;
+Similarly, the canonical threading paradigm is often based on function pointers, \eg @pthread@~\cite{pthreads}, \Csharp~\cite{Csharp}, Go~\cite{Go}, and Scala~\cite{Scala}.
+However, the generic thread-handle (identifier) is limited (few operations), unless it is wrapped in a custom type.
+\begin{cfa}
+void mycor( coroutine_t cid, void * arg ) {
+	int * value = (int *)arg;				$\C{// type unsafe, pointer-size only}$
 	// Coroutine body
 }
-
 int main() {
-	int value = 0;
-	coroutine_t cid = coroutine_create( &foo, (void*)&value );
-	coroutine_resume( &cid );
-}
-\end{cfa}
-This semantics is more common for thread interfaces but coroutines work equally well.
-As discussed in section \ref{threads}, this approach is superseded by static approaches in terms of expressivity.
-
-
-\subsubsection{Alternative: Trait-Based Coroutines}
-
-Finally, the underlying approach, which is the one closest to \CFA idioms, is to use trait-based lazy coroutines.
-This approach defines a coroutine as anything that satisfies the trait @is_coroutine@ (as defined below) and is used as a coroutine.
-
-\begin{cfa}
-trait is_coroutine(dtype T) {
-      void main(T& this);
-      coroutine_desc* get_coroutine(T& this);
+	int input = 0, output;
+	coroutine_t cid = coroutine_create( &mycor, (void *)&input ); $\C{// type unsafe, pointer-size only}$
+	coroutine_resume( cid, (void *)input, (void **)&output ); $\C{// type unsafe, pointer-size only}$
+}
+\end{cfa}
+Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda-based coroutines adds very little.
+
+The selected approach is to use language support by introducing a new kind of aggregate (structure):
+\begin{cfa}
+coroutine Fibonacci {
+	int fn; // communication variables
 };
-
-forall( dtype T | is_coroutine(T) ) void suspend(T&);
-forall( dtype T | is_coroutine(T) ) void resume (T&);
-\end{cfa}
-This ensures that an object is not a coroutine until @resume@ is called on the object.
-Correspondingly, any object that is passed to @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
+\end{cfa}
+The @coroutine@ keyword means the compiler (and tool set) can find and inject code where needed.
+The downside of this approach is that it makes coroutine a special case in the language.
+Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language.
+Furthermore, implementing coroutines without language supports also displays the power of a programming language.
+While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using the language support.
+The reserved keyword eases use for the common cases.
+
+Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrict coroutine-manipulation functions:
+\begin{cfa}
+trait is_coroutine( dtype T ) {
+      void main( T & this );
+      coroutine_desc * get_coroutine( T & this );
+};
+forall( dtype T | is_coroutine(T) ) void get_coroutine( T & );
+forall( dtype T | is_coroutine(T) ) void suspend( T & );
+forall( dtype T | is_coroutine(T) ) void resume( T & );
+\end{cfa}
+This definition ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine.
+No return value or additional parameters are necessary for this function, because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
+As well, any object passed to @suspend@ and @resume@ is a coroutine since it must satisfy the @is_coroutine@ trait to compile.
 The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine.
-The \CFA keyword @coroutine@ simply has the effect of implementing the getter and forward declarations required for users to implement the main routine.
-
-\begin{center}
-\begin{tabular}{c c c}
-\begin{cfa}[tabsize=3]
-coroutine MyCoroutine {
-	int someValue;
+The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main:
+\begin{cquote}
+\begin{tabular}{@{}ccc@{}}
+\begin{cfa}
+coroutine MyCor {
+	int value;
+
 };
-\end{cfa} & == & \begin{cfa}[tabsize=3]
-struct MyCoroutine {
-	int someValue;
-	coroutine_desc __cor;
+\end{cfa}
+& {\Large $\Rightarrow$} &
+\begin{tabular}{@{}ccc@{}}
+\begin{cfa}
+struct MyCor {
+	int value;
+	coroutine_desc cor;
 };
-
-static inline
-coroutine_desc* get_coroutine(
-	struct MyCoroutine& this
-) {
-	return &this.__cor;
-}
-
-void main(struct MyCoroutine* this);
+\end{cfa}
+&
+\begin{cfa}
+static inline coroutine_desc *
+get_coroutine( MyCor & this ) {
+	return &this.cor;
+}
+\end{cfa}
+&
+\begin{cfa}
+void main( MyCor * this );
+
+
+
 \end{cfa}
 \end{tabular}
-\end{center}
-
-The combination of these two approaches allows users new to coroutining and concurrency to have an easy and concise specification, while more advanced users have tighter control on memory layout and initialization.
-
-\subsection{Thread Interface}\label{threads}
-The basic building blocks of multithreading in \CFA are \textbf{cfathread}.
-Both user and kernel threads are supported, where user threads are the concurrency mechanism and kernel threads are the parallel mechanism.
-User threads offer a flexible and lightweight interface.
-A thread can be declared using a struct declaration @thread@ as follows:
-
-\begin{cfa}
-thread foo {};
-\end{cfa}
-
-As for coroutines, the keyword is a thin wrapper around a \CFA trait:
-
-\begin{cfa}
-trait is_thread(dtype T) {
-      void ^?{}(T & mutex this);
-      void main(T & this);
-      thread_desc* get_thread(T & this);
+\end{tabular}
+\end{cquote}
+The combination of these two approaches allows an easy and concise specification to coroutining (and concurrency) for normal users, while more advanced users have tighter control on memory layout and initialization.
+
+
+\subsection{Thread Interface}
+\label{threads}
+
+Both user and kernel threads are supported, where user threads provide concurrency and kernel threads provide parallelism.
+Like coroutines and for the same design reasons, the selected approach for user threads is to use language support by introducing a new kind of aggregate (structure) and a \CFA trait:
+\begin{cquote}
+\begin{tabular}{@{}c@{\hspace{2\parindentlnth}}c@{}}
+\begin{cfa}
+thread myThread {
+	// communication variables
 };
-\end{cfa}
-
-Obviously, for this thread implementation to be useful it must run some user code.
-Several other threading interfaces use a function-pointer representation as the interface of threads (for example \Csharp~\cite{Csharp} and Scala~\cite{Scala}).
-However, this proposal considers that statically tying a @main@ routine to a thread supersedes this approach.
-Since the @main@ routine is already a special routine in \CFA (where the program begins), it is a natural extension of the semantics to use overloading to declare mains for different threads (the normal main being the main of the initial thread).
+
+
+\end{cfa}
+&
+\begin{cfa}
+trait is_thread( dtype T ) {
+      void main( T & this );
+      thread_desc * get_thread( T & this );
+      void ^?{}( T & `mutex` this );
+};
+\end{cfa}
+\end{tabular}
+\end{cquote}
+(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitors}.)
+Like a coroutine, the statically-typed @main@ function is the starting point (first stack frame) of a user thread.
+The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@;
+whereas, a user thread receives its own thread from the runtime system, which starts in @main@ as some point after the thread constructor is run.\footnote{
+The \lstinline@main@ function is already a special routine in C (where the program begins), so it is a natural extension of the semantics to use overloading to declare mains for different coroutines/threads (the normal main being the main of the initial thread).}
+No return value or additional parameters are necessary for this function, because the task type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values.
+
+\begin{comment} % put in appendix with coroutine version ???
 As such the @main@ routine of a thread can be defined as
 \begin{cfa}
@@ -1189,108 +1139,58 @@
 }
 \end{cfa}
-
 A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}.
-
-Of course, for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution.
-While using an \textbf{api} such as @fork@ and @join@ is relatively common in the literature, such an interface is unnecessary.
-Indeed, the simplest approach is to use \textbf{raii} principles and have threads @fork@ after the constructor has completed and @join@ before the destructor runs.
-\begin{cfa}
-thread World;
-
-void main(World & this) {
+\end{comment}
+
+For user threads to be useful, it must be possible to start and stop the underlying thread, and wait for it to complete execution.
+While using an API such as @fork@ and @join@ is relatively common, such an interface is awkward and unnecessary.
+A simple approach is to use allocation/deallocation principles, and have threads implicitly @fork@ after construction and @join@ before destruction.
+\begin{cfa}
+thread World {};
+void main( World & this ) {
 	sout | "World!" | endl;
 }
-
-void main() {
-	World w;
-	// Thread forks here
-
-	// Printing "Hello " and "World!" are run concurrently
-	sout | "Hello " | endl;
-
-	// Implicit join at end of scope
-}
-\end{cfa}
-
-This semantic has several advantages over explicit semantics: a thread is always started and stopped exactly once, users cannot make any programming errors, and it naturally scales to multiple threads meaning basic synchronization is very simple.
-
-\begin{cfa}
-thread MyThread {
-	//...
-};
-
-// main
-void main(MyThread& this) {
-	//...
-}
-
-void foo() {
-	MyThread thrds[10];
-	// Start 10 threads at the beginning of the scope
-
-	DoStuff();
-
-	// Wait for the 10 threads to finish
-}
-\end{cfa}
-
-However, one of the drawbacks of this approach is that threads always form a tree where nodes must always outlive their children, \ie they are always destroyed in the opposite order of construction because of C scoping rules.
-This restriction is relaxed by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
-
-\begin{cfa}
-thread MyThread {
-	//...
-};
-
-void main(MyThread& this) {
-	//...
-}
-
-void foo() {
-	MyThread* long_lived;
+int main() {
+	World w`[10]`;							$\C{// implicit forks after creation}$
+	sout | "Hello " | endl;					$\C{// "Hello " and 10 "World!" printed concurrently}$
+}											$\C{// implicit joins before destruction}$
+\end{cfa}
+This semantics ensures a thread is started and stopped exactly once, eliminating some programming error, and scales to multiple threads for basic (termination) synchronization.
+This tree-structure (lattice) create/delete from C block-structure is generalized by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
+\begin{cfa}
+int main() {
+	MyThread * heapLived;
 	{
-		// Start a thread at the beginning of the scope
-		MyThread short_lived;
-
-		// create another thread that will outlive the thread in this scope
-		long_lived = new MyThread;
-
-		DoStuff();
-
-		// Wait for the thread short_lived to finish
-	}
-	DoMoreStuff();
-
-	// Now wait for the long_lived to finish
-	delete long_lived;
-}
-\end{cfa}
-
-
-% ======================================================================
-% ======================================================================
-\section{Concurrency}
-% ======================================================================
-% ======================================================================
-Several tools can be used to solve concurrency challenges.
-Since many of these challenges appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka (Scala)~\cite{Akka}).
-In these paradigms, interaction among concurrent objects relies on message passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (channels~\cite{CSP,Go} for example).
-However, in languages that use routine calls as their core abstraction mechanism, these approaches force a clear distinction between concurrent and non-concurrent paradigms (\ie message passing versus routine calls).
-This distinction in turn means that, in order to be effective, programmers need to learn two sets of design patterns.
+		MyThread blockLived;				$\C{// fork block-based thread}$
+		heapLived = `new`( MyThread );		$\C{// fork heap-based thread}$
+		...
+	}										$\C{// join block-based thread}$
+	...
+	`delete`( heapLived );					$\C{// join heap-based thread}$
+}
+\end{cfa}
+The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency.
+
+
+\section{Synchronization / Mutual Exclusion}
+
+Uncontrolled non-deterministic execution is meaningless.
+To reestablish meaningful execution requires mechanisms to reintroduce determinism (control non-determinism), called synchronization and mutual exclusion, where \newterm{synchronization} is a timing relationship among threads and \newterm{mutual exclusion} is an access-control mechanism on data shared by threads.
+Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala)).
+In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (\eg channels~\cite{CSP,Go}).
+However, in call/return-based languages, these approaches force a clear distinction (\ie introduce a new programming paradigm) between non-concurrent and concurrent computation (\ie function call versus message passing).
+This distinction means a programmers needs to learn two sets of design patterns.
 While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
-
-Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on basic constructs like routine calls and shared objects.
-At the lowest level, concurrent paradigms are implemented as atomic operations and locks.
-Many such mechanisms have been proposed, including semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
-However, for productivity reasons it is desirable to have a higher-level construct be the core concurrency paradigm~\cite{Hochstein05}.
-
-An approach that is worth mentioning because it is gaining in popularity is transactional memory~\cite{Herlihy93}.
-While this approach is even pursued by system languages like \CC~\cite{Cpp-Transactions}, the performance and feature set is currently too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
-
-One of the most natural, elegant, and efficient mechanisms for synchronization and communication, especially for shared-memory systems, is the \emph{monitor}.
+In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
+
+At the lowest level, concurrent control is implemented as atomic operations, upon which difference kinds of locks/approaches are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
+However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
+An newer approach worth mentioning is transactional memory~\cite{Herlihy93}.
+While this approach is pursued in hardware~\cite{} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
+
+One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
 Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
 Many programming languages---\eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs.
 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
-For these reasons, this project proposes monitors as the core concurrency construct.
+For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..
 
 
@@ -1329,9 +1229,7 @@
 
 
-% ======================================================================
-% ======================================================================
 \section{Monitors}
-% ======================================================================
-% ======================================================================
+\label{s:Monitors}
+
 A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state.
 More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine.
@@ -2501,5 +2399,5 @@
 Given these building blocks, it is possible to reproduce all three of the popular paradigms.
 Indeed, \textbf{uthread} is the default paradigm in \CFA.
-However, disabling \textbf{preemption} on the \textbf{cfacluster} means \textbf{cfathread} effectively become \textbf{fiber}.
+However, disabling \textbf{preemption} on a cluster means threads effectively become fibers.
 Since several \textbf{cfacluster} with different scheduling policy can coexist in the same application, this allows \textbf{fiber} and \textbf{uthread} to coexist in the runtime of an application.
 Finally, it is possible to build executors for thread pools from \textbf{uthread} or \textbf{fiber}, which includes specialized jobs like actors~\cite{Actors}.