Index: doc/papers/concurrency/.gitignore
===================================================================
--- doc/papers/concurrency/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,29 @@
+build/*.aux
+build/*.acn
+build/*.acr
+build/*.alg
+build/*.bbl
+build/*.blg
+build/*.brf
+build/*.dvi
+build/*.glg
+build/*.glo
+build/*.gls
+build/*.idx
+build/*.ind
+build/*.ist
+build/*.lof
+build/*.log
+build/*.lol
+build/*.lot
+build/*.out
+build/*.ps
+build/*.pstex
+build/*.pstex_t
+build/*.tex
+build/*.toc
+*.pdf
+*.png
+figures/*.tex
+
+examples
Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/Paper.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,2745 @@
+% requires tex packages: texlive-base texlive-latex-base tex-common texlive-humanities texlive-latex-extra texlive-fonts-recommended
+
+% inline code �...� (copyright symbol) emacs: C-q M-)
+% red highlighting �...� (registered trademark symbol) emacs: C-q M-.
+% blue highlighting �...� (sharp s symbol) emacs: C-q M-_
+% green highlighting �...� (cent symbol) emacs: C-q M-"
+% LaTex escape �...� (section symbol) emacs: C-q M-'
+% keyword escape �...� (pilcrow symbol) emacs: C-q M-^
+% math escape $...$ (dollar symbol)
+
+\documentclass[10pt]{article}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+% Latex packages used in the document.
+\usepackage[T1]{fontenc}					% allow Latin1 (extended ASCII) characters
+\usepackage{textcomp}
+\usepackage[latin1]{inputenc}
+\usepackage{fullpage,times,comment}
+\usepackage{epic,eepic}
+\usepackage{upquote}						% switch curled `'" to straight
+\usepackage{dirtytalk}
+\usepackage{calc}
+\usepackage{xspace}
+\usepackage[labelformat=simple]{subfig}
+\renewcommand{\thesubfigure}{(\alph{subfigure})}
+\usepackage{graphicx}
+\usepackage{tabularx}
+\usepackage{multicol}
+\usepackage{varioref}
+\usepackage{listings}						% format program code
+\usepackage[flushmargin]{footmisc}				% support label/reference in footnote
+\usepackage{latexsym}						% \Box glyph
+\usepackage{mathptmx}						% better math font with "times"
+\usepackage[usenames]{color}
+\usepackage[pagewise]{lineno}
+\renewcommand{\linenumberfont}{\scriptsize\sffamily}
+\usepackage{fancyhdr}
+\usepackage{float}
+\usepackage{siunitx}
+\sisetup{ binary-units=true }
+\input{style}							% bespoke macros used in the document
+\usepackage{url}
+\usepackage[dvips,plainpages=false,pdfpagelabels,pdfpagemode=UseNone,colorlinks=true,pagebackref=true,linkcolor=blue,citecolor=blue,urlcolor=blue,pagebackref=true,breaklinks=true]{hyperref}
+\usepackage{breakurl}
+\urlstyle{rm}
+
+\usepackage{tikz}
+\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
+
+\setlength{\topmargin}{-0.45in}				% move running title into header
+\setlength{\headsep}{0.25in}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+% Names used in the document.
+
+\newcommand{\Version}{1.0.0}
+\newcommand{\CS}{C\raisebox{-0.9ex}{\large$^\sharp$}\xspace}
+
+\newcommand{\Textbf}[2][red]{{\color{#1}{\textbf{#2}}}}
+\newcommand{\Emph}[2][red]{{\color{#1}\textbf{\emph{#2}}}}
+\newcommand{\R}[1]{\Textbf{#1}}
+\newcommand{\B}[1]{{\Textbf[blue]{#1}}}
+\newcommand{\G}[1]{{\Textbf[OliveGreen]{#1}}}
+\newcommand{\uC}{$\mu$\CC}
+\newcommand{\cit}{\textsuperscript{[Citation Needed]}\xspace}
+\newcommand{\TODO}{{\Textbf{TODO}}}
+
+
+\newsavebox{\LstBox}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\setcounter{secnumdepth}{2}                           % number subsubsections
+\setcounter{tocdepth}{2}                              % subsubsections in table of contents
+% \linenumbers                                       	% comment out to turn off line numbering
+
+\title{Concurrency in \CFA}
+\author{Thierry Delisle, Waterloo, Ontario, Canada, 2018}
+
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\CFA is a modern, non-object-oriented extension of the C programming language. This thesis serves as a definition and an implementation for the concurrency and parallelism \CFA offers. These features are created from scratch due to the lack of concurrency in ISO C. Lightweight threads are introduced into the language. In addition, monitors are introduced as a high-level tool for control-flow based synchronization and mutual-exclusion. The main contributions of this thesis are two-fold: it extends the existing semantics of monitors introduce by~\cite{Hoare74} to handle monitors in groups and also details the engineering effort needed to introduce these features as core language features. Indeed, these features are added with respect to expectations of C programmers, and integrate with the \CFA type-system and other language features.
+\end{abstract}
+
+%----------------------------------------------------------------------
+% MAIN BODY
+%----------------------------------------------------------------------
+
+% ======================================================================
+\section{Introduction}
+% ======================================================================
+This thesis provides a minimal concurrency \textbf{api} that is simple, efficient and can be reused to build higher-level features. The simplest possible concurrency system is a thread and a lock but this low-level approach is hard to master. An easier approach for users is to support higher-level constructs as the basis of concurrency. Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{HPP:Study}. Examples are task based, message passing and implicit threading. The high-level approach and its minimal \textbf{api} are tested in a dialect of C, called \CFA. Furthermore, the proposed \textbf{api} doubles as an early definition of the \CFA language and library. This thesis also provides an implementation of the concurrency library for \CFA as well as all the required language features added to the source-to-source translator.
+
+There are actually two problems that need to be solved in the design of concurrency for a programming language: which concurrency and which parallelism tools are available to the programmer. While these two concepts are often combined, they are in fact distinct, requiring different tools~\cite{Buhr05a}. Concurrency tools need to handle mutual exclusion and synchronization, while parallelism tools are about performance, cost and resource utilization.
+
+In the context of this thesis, a \textbf{thread} is a fundamental unit of execution that runs a sequence of code, generally on a program stack. Having multiple simultaneous threads gives rise to concurrency and generally requires some kind of locking mechanism to ensure proper execution. Correspondingly, \textbf{concurrency} is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, etc.) concurrent threads are introduced. Accordingly, \textbf{locking} (and by extension locks) are defined as a mechanism that prevents the progress of certain threads in order to avoid problems due to concurrency. Finally, in this thesis \textbf{parallelism} is distinct from concurrency and is defined as running multiple threads simultaneously. More precisely, parallelism implies \emph{actual} simultaneous execution as opposed to concurrency which only requires \emph{apparent} simultaneous execution. As such, parallelism is only observable in the differences in performance or, more generally, differences in timing.
+
+% ======================================================================
+% ======================================================================
+\section{\CFA Overview}
+% ======================================================================
+% ======================================================================
+
+The following is a quick introduction to the \CFA language, specifically tailored to the features needed to support concurrency.
+
+\CFA is an extension of ISO-C and therefore supports all of the same paradigms as C. It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily. Like C, the basics of \CFA revolve around structures and routines, which are thin abstractions over machine code. The vast majority of the code produced by the \CFA translator respects memory layouts and calling conventions laid out by C. Interestingly, while \CFA is not an object-oriented language, lacking the concept of a receiver (e.g., {\tt this}), it does have some notion of objects\footnote{C defines the term objects as : ``region of data storage in the execution environment, the contents of which can represent
+values''~\cite[3.15]{C11}}, most importantly construction and destruction of objects. Most of the following code examples can be found on the \CFA website~\cite{www-cfa}.
+
+% ======================================================================
+\section{References}
+
+Like \CC, \CFA introduces rebind-able references providing multiple dereferencing as an alternative to pointers. In regards to concurrency, the semantic difference between pointers and references are not particularly relevant, but since this document uses mostly references, here is a quick overview of the semantics:
+\begin{cfacode}
+int x, *p1 = &x, **p2 = &p1, ***p3 = &p2,
+	&r1 = x,    &&r2 = r1,   &&&r3 = r2;
+***p3 = 3;							//change x
+r3    = 3;							//change x, ***r3
+**p3  = ...;						//change p1
+*p3   = ...;						//change p2
+int y, z, & ar[3] = {x, y, z};		//initialize array of references
+typeof( ar[1]) p;					//is int, referenced object type
+typeof(&ar[1]) q;					//is int &, reference type
+sizeof( ar[1]) == sizeof(int);		//is true, referenced object size
+sizeof(&ar[1]) == sizeof(int *);	//is true, reference size
+\end{cfacode}
+The important take away from this code example is that a reference offers a handle to an object, much like a pointer, but which is automatically dereferenced for convenience.
+
+% ======================================================================
+\section{Overloading}
+
+Another important feature of \CFA is function overloading as in Java and \CC, where routines with the same name are selected based on the number and type of the arguments. As well, \CFA uses the return type as part of the selection criteria, as in Ada~\cite{Ada}. For routines with multiple parameters and returns, the selection is complex.
+\begin{cfacode}
+//selection based on type and number of parameters
+void f(void);			//(1)
+void f(char);			//(2)
+void f(int, double);	//(3)
+f();					//select (1)
+f('a');					//select (2)
+f(3, 5.2);				//select (3)
+
+//selection based on  type and number of returns
+char   f(int);			//(1)
+double f(int);			//(2)
+char   c = f(3);		//select (1)
+double d = f(4);		//select (2)
+\end{cfacode}
+This feature is particularly important for concurrency since the runtime system relies on creating different types to represent concurrency objects. Therefore, overloading is necessary to prevent the need for long prefixes and other naming conventions that prevent name clashes. As seen in section \ref{basics}, routine \code{main} is an example that benefits from overloading.
+
+% ======================================================================
+\section{Operators}
+Overloading also extends to operators. The syntax for denoting operator-overloading is to name a routine with the symbol of the operator and question marks where the arguments of the operation appear, e.g.:
+\begin{cfacode}
+int ++? (int op);              		//unary prefix increment
+int ?++ (int op);              		//unary postfix increment
+int ?+? (int op1, int op2);    		//binary plus
+int ?<=?(int op1, int op2);   		//binary less than
+int ?=? (int & op1, int op2);  		//binary assignment
+int ?+=?(int & op1, int op2); 		//binary plus-assignment
+
+struct S {int i, j;};
+S ?+?(S op1, S op2) {				//add two structures
+	return (S){op1.i + op2.i, op1.j + op2.j};
+}
+S s1 = {1, 2}, s2 = {2, 3}, s3;
+s3 = s1 + s2;						//compute sum: s3 == {2, 5}
+\end{cfacode}
+While concurrency does not use operator overloading directly, this feature is more important as an introduction for the syntax of constructors.
+
+% ======================================================================
+\section{Constructors/Destructors}
+Object lifetime is often a challenge in concurrency. \CFA uses the approach of giving concurrent meaning to object lifetime as a means of synchronization and/or mutual exclusion. Since \CFA relies heavily on the lifetime of objects, constructors and destructors is a core feature required for concurrency and parallelism. \CFA uses the following syntax for constructors and destructors:
+\begin{cfacode}
+struct S {
+	size_t size;
+	int * ia;
+};
+void ?{}(S & s, int asize) {	//constructor operator
+	s.size = asize;				//initialize fields
+	s.ia = calloc(size, sizeof(S));
+}
+void ^?{}(S & s) {				//destructor operator
+	free(ia);					//de-initialization fields
+}
+int main() {
+	S x = {10}, y = {100};		//implicit calls: ?{}(x, 10), ?{}(y, 100)
+	...							//use x and y
+	^x{};  ^y{};				//explicit calls to de-initialize
+	x{20};  y{200};				//explicit calls to reinitialize
+	...							//reuse x and y
+}								//implicit calls: ^?{}(y), ^?{}(x)
+\end{cfacode}
+The language guarantees that every object and all their fields are constructed. Like \CC, construction of an object is automatically done on allocation and destruction of the object is done on deallocation. Allocation and deallocation can occur on the stack or on the heap.
+\begin{cfacode}
+{
+	struct S s = {10};	//allocation, call constructor
+	...
+}						//deallocation, call destructor
+struct S * s = new();	//allocation, call constructor
+...
+delete(s);				//deallocation, call destructor
+\end{cfacode}
+Note that like \CC, \CFA introduces \code{new} and \code{delete}, which behave like \code{malloc} and \code{free} in addition to constructing and destructing objects, after calling \code{malloc} and before calling \code{free}, respectively.
+
+% ======================================================================
+\section{Parametric Polymorphism}
+\label{s:ParametricPolymorphism}
+Routines in \CFA can also be reused for multiple types. This capability is done using the \code{forall} clauses, which allow separately compiled routines to support generic usage over multiple types. For example, the following sum function works for any type that supports construction from 0 and addition:
+\begin{cfacode}
+//constraint type, 0 and +
+forall(otype T | { void ?{}(T *, zero_t); T ?+?(T, T); })
+T sum(T a[ ], size_t size) {
+	T total = 0;				//construct T from 0
+	for(size_t i = 0; i < size; i++)
+		total = total + a[i];	//select appropriate +
+	return total;
+}
+
+S sa[5];
+int i = sum(sa, 5);				//use S's 0 construction and +
+\end{cfacode}
+
+Since writing constraints on types can become cumbersome for more constrained functions, \CFA also has the concept of traits. Traits are named collection of constraints that can be used both instead and in addition to regular constraints:
+\begin{cfacode}
+trait summable( otype T ) {
+	void ?{}(T *, zero_t);		//constructor from 0 literal
+	T ?+?(T, T);				//assortment of additions
+	T ?+=?(T *, T);
+	T ++?(T *);
+	T ?++(T *);
+};
+forall( otype T | summable(T) )	//use trait
+T sum(T a[], size_t size);
+\end{cfacode}
+
+Note that the type use for assertions can be either an \code{otype} or a \code{dtype}. Types declared as \code{otype} refer to ``complete'' objects, i.e., objects with a size, a default constructor, a copy constructor, a destructor and an assignment operator. Using \code{dtype,} on the other hand, has none of these assumptions but is extremely restrictive, it only guarantees the object is addressable.
+
+% ======================================================================
+\section{with Clause/Statement}
+Since \CFA lacks the concept of a receiver, certain functions end up needing to repeat variable names often. To remove this inconvenience, \CFA provides the \code{with} statement, which opens an aggregate scope making its fields directly accessible (like Pascal).
+\begin{cfacode}
+struct S { int i, j; };
+int mem(S & this) with (this)		//with clause
+	i = 1;							//this->i
+	j = 2;							//this->j
+}
+int foo() {
+	struct S1 { ... } s1;
+	struct S2 { ... } s2;
+	with (s1) 						//with statement
+	{
+		//access fields of s1 without qualification
+		with (s2)					//nesting
+		{
+			//access fields of s1 and s2 without qualification
+		}
+	}
+	with (s1, s2) 					//scopes open in parallel
+	{
+		//access fields of s1 and s2 without qualification
+	}
+}
+\end{cfacode}
+
+For more information on \CFA see \cite{cforall-ug,rob-thesis,www-cfa}.
+
+% ======================================================================
+% ======================================================================
+\section{Concurrency Basics}\label{basics}
+% ======================================================================
+% ======================================================================
+Before any detailed discussion of the concurrency and parallelism in \CFA, it is important to describe the basics of concurrency and how they are expressed in \CFA user code.
+
+\section{Basics of concurrency}
+At its core, concurrency is based on having multiple call-stacks and scheduling among threads of execution executing on these stacks. Concurrency without parallelism only requires having multiple call stacks (or contexts) for a single thread of execution.
+
+Execution with a single thread and multiple stacks where the thread is self-scheduling deterministically across the stacks is called coroutining. Execution with a single and multiple stacks but where the thread is scheduled by an oracle (non-deterministic from the thread's perspective) across the stacks is called concurrency.
+
+Therefore, a minimal concurrency system can be achieved by creating coroutines (see Section \ref{coroutine}), which instead of context-switching among each other, always ask an oracle where to context-switch next. While coroutines can execute on the caller's stack-frame, stack-full coroutines allow full generality and are sufficient as the basis for concurrency. The aforementioned oracle is a scheduler and the whole system now follows a cooperative threading-model (a.k.a., non-preemptive scheduling). The oracle/scheduler can either be a stack-less or stack-full entity and correspondingly require one or two context-switches to run a different coroutine. In any case, a subset of concurrency related challenges start to appear. For the complete set of concurrency challenges to occur, the only feature missing is preemption.
+
+A scheduler introduces order of execution uncertainty, while preemption introduces uncertainty about where context switches occur. Mutual exclusion and synchronization are ways of limiting non-determinism in a concurrent system. Now it is important to understand that uncertainty is desirable; uncertainty can be used by runtime systems to significantly increase performance and is often the basis of giving a user the illusion that tasks are running in parallel. Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows.
+
+\section{\protect\CFA's Thread Building Blocks}
+One of the important features that are missing in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the thesis, neither \texttt{gcc} nor \texttt{clang} support ``threads.h'' in their respective standard libraries.}. On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore modern programming languages must have the proper tools to allow users to write efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. And being a system-level language means programmers expect to choose precisely which features they need and which cost they are willing to pay.
+
+\section{Coroutines: A Stepping Stone}\label{coroutine}
+While the main focus of this proposal is concurrency and parallelism, it is important to address coroutines, which are actually a significant building block of a concurrency system. \textbf{Coroutine}s are generalized routines which have predefined points where execution is suspended and can be resumed at a later time. Therefore, they need to deal with context switches and other context-management operations. This proposal includes coroutines both as an intermediate step for the implementation of threads, and a first-class feature of \CFA. Furthermore, many design challenges of threads are at least partially present in designing coroutines, which makes the design effort that much more relevant. The core \textbf{api} of coroutines revolves around two features: independent call-stacks and \code{suspend}/\code{resume}.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{c @{\hskip 0.025in}|@{\hskip 0.025in} c @{\hskip 0.025in}|@{\hskip 0.025in} c}
+\begin{ccode}[tabsize=2]
+//Using callbacks
+void fibonacci_func(
+	int n,
+	void (*callback)(int)
+) {
+	int first = 0;
+	int second = 1;
+	int next, i;
+	for(i = 0; i < n; i++)
+	{
+		if(i <= 1)
+			next = i;
+		else {
+			next = f1 + f2;
+			f1 = f2;
+			f2 = next;
+		}
+		callback(next);
+	}
+}
+
+int main() {
+	void print_fib(int n) {
+		printf("%d\n", n);
+	}
+
+	fibonacci_func(
+		10, print_fib
+	);
+
+
+
+}
+\end{ccode}&\begin{ccode}[tabsize=2]
+//Using output array
+void fibonacci_array(
+	int n,
+	int* array
+) {
+	int f1 = 0; int f2 = 1;
+	int next, i;
+	for(i = 0; i < n; i++)
+	{
+		if(i <= 1)
+			next = i;
+		else {
+			next = f1 + f2;
+			f1 = f2;
+			f2 = next;
+		}
+		array[i] = next;
+	}
+}
+
+
+int main() {
+	int a[10];
+
+	fibonacci_func(
+		10, a
+	);
+
+	for(int i=0;i<10;i++){
+		printf("%d\n", a[i]);
+	}
+
+}
+\end{ccode}&\begin{ccode}[tabsize=2]
+//Using external state
+typedef struct {
+	int f1, f2;
+} Iterator_t;
+
+int fibonacci_state(
+	Iterator_t* it
+) {
+	int f;
+	f = it->f1 + it->f2;
+	it->f2 = it->f1;
+	it->f1 = max(f,1);
+	return f;
+}
+
+
+
+
+
+
+
+int main() {
+	Iterator_t it={0,0};
+
+	for(int i=0;i<10;i++){
+		printf("%d\n",
+			fibonacci_state(
+				&it
+			);
+		);
+	}
+
+}
+\end{ccode}
+\end{tabular}
+\end{center}
+\caption{Different implementations of a Fibonacci sequence generator in C.}
+\label{lst:fibonacci-c}
+\end{table}
+
+A good example of a problem made easier with coroutines is generators, e.g., generating the Fibonacci sequence. This problem comes with the challenge of decoupling how a sequence is generated and how it is used. Listing \ref{lst:fibonacci-c} shows conventional approaches to writing generators in C. All three of these approach suffer from strong coupling. The left and centre approaches require that the generator have knowledge of how the sequence is used, while the rightmost approach requires holding internal state between calls on behalf of the generator and makes it much harder to handle corner cases like the Fibonacci seed.
+
+Listing \ref{lst:fibonacci-cfa} is an example of a solution to the Fibonacci problem using \CFA coroutines, where the coroutine stack holds sufficient state for the next generation. This solution has the advantage of having very strong decoupling between how the sequence is generated and how it is used. Indeed, this version is as easy to use as the \code{fibonacci_state} solution, while the implementation is very similar to the \code{fibonacci_func} example.
+
+\begin{figure}
+\begin{cfacode}[caption={Implementation of Fibonacci using coroutines},label={lst:fibonacci-cfa}]
+coroutine Fibonacci {
+	int fn; //used for communication
+};
+
+void ?{}(Fibonacci& this) { //constructor
+	this.fn = 0;
+}
+
+//main automatically called on first resume
+void main(Fibonacci& this) with (this) {
+	int fn1, fn2; 		//retained between resumes
+	fn  = 0;
+	fn1 = fn;
+	suspend(this); 		//return to last resume
+
+	fn  = 1;
+	fn2 = fn1;
+	fn1 = fn;
+	suspend(this); 		//return to last resume
+
+	for ( ;; ) {
+		fn  = fn1 + fn2;
+		fn2 = fn1;
+		fn1 = fn;
+		suspend(this); 	//return to last resume
+	}
+}
+
+int next(Fibonacci& this) {
+	resume(this); //transfer to last suspend
+	return this.fn;
+}
+
+void main() { //regular program main
+	Fibonacci f1, f2;
+	for ( int i = 1; i <= 10; i += 1 ) {
+		sout | next( f1 ) | next( f2 ) | endl;
+	}
+}
+\end{cfacode}
+\end{figure}
+
+Listing \ref{lst:fmt-line} shows the \code{Format} coroutine for restructuring text into groups of character blocks of fixed size. The example takes advantage of resuming coroutines in the constructor to simplify the code and highlights the idea that interesting control flow can occur in the constructor.
+
+\begin{figure}
+\begin{cfacode}[tabsize=3,caption={Formatting text into lines of 5 blocks of 4 characters.},label={lst:fmt-line}]
+//format characters into blocks of 4 and groups of 5 blocks per line
+coroutine Format {
+	char ch;									//used for communication
+	int g, b;								//global because used in destructor
+};
+
+void  ?{}(Format& fmt) {
+	resume( fmt );  						//prime (start) coroutine
+}
+
+void ^?{}(Format& fmt) with fmt {
+	if ( fmt.g != 0 || fmt.b != 0 )
+	sout | endl;
+}
+
+void main(Format& fmt) with fmt {
+	for ( ;; ) {							//for as many characters
+		for(g = 0; g < 5; g++) {		//groups of 5 blocks
+			for(b = 0; b < 4; fb++) {	//blocks of 4 characters
+				suspend();
+				sout | ch;					//print character
+			}
+			sout | "  ";					//print block separator
+		}
+		sout | endl;						//print group separator
+	}
+}
+
+void prt(Format & fmt, char ch) {
+	fmt.ch = ch;
+	resume(fmt);
+}
+
+int main() {
+	Format fmt;
+	char ch;
+	Eof: for ( ;; ) {						//read until end of file
+		sin | ch;							//read one character
+		if(eof(sin)) break Eof;			//eof ?
+		prt(fmt, ch);						//push character for formatting
+	}
+}
+\end{cfacode}
+\end{figure}
+
+\subsection{Construction}
+One important design challenge for implementing coroutines and threads (shown in section \ref{threads}) is that the runtime system needs to run code after the user-constructor runs to connect the fully constructed object into the system. In the case of coroutines, this challenge is simpler since there is no non-determinism from preemption or scheduling. However, the underlying challenge remains the same for coroutines and threads.
+
+The runtime system needs to create the coroutine's stack and, more importantly, prepare it for the first resumption. The timing of the creation is non-trivial since users expect both to have fully constructed objects once execution enters the coroutine main and to be able to resume the coroutine from the constructor. There are several solutions to this problem but the chosen option effectively forces the design of the coroutine.
+
+Furthermore, \CFA faces an extra challenge as polymorphic routines create invisible thunks when cast to non-polymorphic routines and these thunks have function scope. For example, the following code, while looking benign, can run into undefined behaviour because of thunks:
+
+\begin{cfacode}
+//async: Runs function asynchronously on another thread
+forall(otype T)
+extern void async(void (*func)(T*), T* obj);
+
+forall(otype T)
+void noop(T*) {}
+
+void bar() {
+	int a;
+	async(noop, &a); //start thread running noop with argument a
+}
+\end{cfacode}
+
+The generated C code\footnote{Code trimmed down for brevity} creates a local thunk to hold type information:
+
+\begin{ccode}
+extern void async(/* omitted */, void (*func)(void*), void* obj);
+
+void noop(/* omitted */, void* obj){}
+
+void bar(){
+	int a;
+	void _thunk0(int* _p0){
+		/* omitted */
+		noop(/* omitted */, _p0);
+	}
+	/* omitted */
+	async(/* omitted */, ((void (*)(void*))(&_thunk0)), (&a));
+}
+\end{ccode}
+The problem in this example is a storage management issue, the function pointer \code{_thunk0} is only valid until the end of the block, which limits the viable solutions because storing the function pointer for too long causes undefined behaviour; i.e., the stack-based thunk being destroyed before it can be used. This challenge is an extension of challenges that come with second-class routines. Indeed, GCC nested routines also have the limitation that nested routine cannot be passed outside of the declaration scope. The case of coroutines and threads is simply an extension of this problem to multiple call stacks.
+
+\subsection{Alternative: Composition}
+One solution to this challenge is to use composition/containment, where coroutine fields are added to manage the coroutine.
+
+\begin{cfacode}
+struct Fibonacci {
+	int fn; //used for communication
+	coroutine c; //composition
+};
+
+void FibMain(void*) {
+	//...
+}
+
+void ?{}(Fibonacci& this) {
+	this.fn = 0;
+	//Call constructor to initialize coroutine
+	(this.c){myMain};
+}
+\end{cfacode}
+The downside of this approach is that users need to correctly construct the coroutine handle before using it. Like any other objects, the user must carefully choose construction order to prevent usage of objects not yet constructed. However, in the case of coroutines, users must also pass to the coroutine information about the coroutine main, like in the previous example. This opens the door for user errors and requires extra runtime storage to pass at runtime information that can be known statically.
+
+\subsection{Alternative: Reserved keyword}
+The next alternative is to use language support to annotate coroutines as follows:
+
+\begin{cfacode}
+coroutine Fibonacci {
+	int fn; //used for communication
+};
+\end{cfacode}
+The \code{coroutine} keyword means the compiler can find and inject code where needed. The downside of this approach is that it makes coroutine a special case in the language. Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language. Furthermore, implementing coroutines without language supports also displays the power of the programming language used. While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed by users without using the language support. The reserved keywords are only present to improve ease of use for the common cases.
+
+\subsection{Alternative: Lambda Objects}
+
+For coroutines as for threads, many implementations are based on routine pointers or function objects~\cite{Butenhof97, C++14, MS:VisualC++, BoostCoroutines15}. For example, Boost implements coroutines in terms of four functor object types:
+\begin{cfacode}
+asymmetric_coroutine<>::pull_type
+asymmetric_coroutine<>::push_type
+symmetric_coroutine<>::call_type
+symmetric_coroutine<>::yield_type
+\end{cfacode}
+Often, the canonical threading paradigm in languages is based on function pointers, \texttt{pthread} being one of the most well-known examples. The main problem of this approach is that the thread usage is limited to a generic handle that must otherwise be wrapped in a custom type. Since the custom type is simple to write in \CFA and solves several issues, added support for routine/lambda based coroutines adds very little.
+
+A variation of this would be to use a simple function pointer in the same way \texttt{pthread} does for threads:
+\begin{cfacode}
+void foo( coroutine_t cid, void* arg ) {
+	int* value = (int*)arg;
+	//Coroutine body
+}
+
+int main() {
+	int value = 0;
+	coroutine_t cid = coroutine_create( &foo, (void*)&value );
+	coroutine_resume( &cid );
+}
+\end{cfacode}
+This semantics is more common for thread interfaces but coroutines work equally well. As discussed in section \ref{threads}, this approach is superseded by static approaches in terms of expressivity.
+
+\subsection{Alternative: Trait-Based Coroutines}
+
+Finally, the underlying approach, which is the one closest to \CFA idioms, is to use trait-based lazy coroutines. This approach defines a coroutine as anything that satisfies the trait \code{is_coroutine} (as defined below) and is used as a coroutine.
+
+\begin{cfacode}
+trait is_coroutine(dtype T) {
+      void main(T& this);
+      coroutine_desc* get_coroutine(T& this);
+};
+
+forall( dtype T | is_coroutine(T) ) void suspend(T&);
+forall( dtype T | is_coroutine(T) ) void resume (T&);
+\end{cfacode}
+This ensures that an object is not a coroutine until \code{resume} is called on the object. Correspondingly, any object that is passed to \code{resume} is a coroutine since it must satisfy the \code{is_coroutine} trait to compile. The advantage of this approach is that users can easily create different types of coroutines, for example, changing the memory layout of a coroutine is trivial when implementing the \code{get_coroutine} routine. The \CFA keyword \code{coroutine} simply has the effect of implementing the getter and forward declarations required for users to implement the main routine.
+
+\begin{center}
+\begin{tabular}{c c c}
+\begin{cfacode}[tabsize=3]
+coroutine MyCoroutine {
+	int someValue;
+};
+\end{cfacode} & == & \begin{cfacode}[tabsize=3]
+struct MyCoroutine {
+	int someValue;
+	coroutine_desc __cor;
+};
+
+static inline
+coroutine_desc* get_coroutine(
+	struct MyCoroutine& this
+) {
+	return &this.__cor;
+}
+
+void main(struct MyCoroutine* this);
+\end{cfacode}
+\end{tabular}
+\end{center}
+
+The combination of these two approaches allows users new to coroutining and concurrency to have an easy and concise specification, while more advanced users have tighter control on memory layout and initialization.
+
+\section{Thread Interface}\label{threads}
+The basic building blocks of multithreading in \CFA are \textbf{cfathread}. Both user and kernel threads are supported, where user threads are the concurrency mechanism and kernel threads are the parallel mechanism. User threads offer a flexible and lightweight interface. A thread can be declared using a struct declaration \code{thread} as follows:
+
+\begin{cfacode}
+thread foo {};
+\end{cfacode}
+
+As for coroutines, the keyword is a thin wrapper around a \CFA trait:
+
+\begin{cfacode}
+trait is_thread(dtype T) {
+      void ^?{}(T & mutex this);
+      void main(T & this);
+      thread_desc* get_thread(T & this);
+};
+\end{cfacode}
+
+Obviously, for this thread implementation to be useful it must run some user code. Several other threading interfaces use a function-pointer representation as the interface of threads (for example \Csharp~\cite{Csharp} and Scala~\cite{Scala}). However, this proposal considers that statically tying a \code{main} routine to a thread supersedes this approach. Since the \code{main} routine is already a special routine in \CFA (where the program begins), it is a natural extension of the semantics to use overloading to declare mains for different threads (the normal main being the main of the initial thread). As such the \code{main} routine of a thread can be defined as
+\begin{cfacode}
+thread foo {};
+
+void main(foo & this) {
+	sout | "Hello World!" | endl;
+}
+\end{cfacode}
+
+In this example, threads of type \code{foo} start execution in the \code{void main(foo &)} routine, which prints \code{"Hello World!".} While this thesis encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.
+\begin{cfacode}
+typedef void (*voidFunc)(int);
+
+thread FuncRunner {
+	voidFunc func;
+	int arg;
+};
+
+void ?{}(FuncRunner & this, voidFunc inFunc, int arg) {
+	this.func = inFunc;
+	this.arg  = arg;
+}
+
+void main(FuncRunner & this) {
+	//thread starts here and runs the function
+	this.func( this.arg );
+}
+
+void hello(/*unused*/ int) {
+	sout | "Hello World!" | endl;
+}
+
+int main() {
+	FuncRunner f = {hello, 42};
+	return 0?
+}
+\end{cfacode}
+
+A consequence of the strongly typed approach to main is that memory layout of parameters and return values to/from a thread are now explicitly specified in the \textbf{api}.
+
+Of course, for threads to be useful, it must be possible to start and stop threads and wait for them to complete execution. While using an \textbf{api} such as \code{fork} and \code{join} is relatively common in the literature, such an interface is unnecessary. Indeed, the simplest approach is to use \textbf{raii} principles and have threads \code{fork} after the constructor has completed and \code{join} before the destructor runs.
+\begin{cfacode}
+thread World;
+
+void main(World & this) {
+	sout | "World!" | endl;
+}
+
+void main() {
+	World w;
+	//Thread forks here
+
+	//Printing "Hello " and "World!" are run concurrently
+	sout | "Hello " | endl;
+
+	//Implicit join at end of scope
+}
+\end{cfacode}
+
+This semantic has several advantages over explicit semantics: a thread is always started and stopped exactly once, users cannot make any programming errors, and it naturally scales to multiple threads meaning basic synchronization is very simple.
+
+\begin{cfacode}
+thread MyThread {
+	//...
+};
+
+//main
+void main(MyThread& this) {
+	//...
+}
+
+void foo() {
+	MyThread thrds[10];
+	//Start 10 threads at the beginning of the scope
+
+	DoStuff();
+
+	//Wait for the 10 threads to finish
+}
+\end{cfacode}
+
+However, one of the drawbacks of this approach is that threads always form a tree where nodes must always outlive their children, i.e., they are always destroyed in the opposite order of construction because of C scoping rules. This restriction is relaxed by using dynamic allocation, so threads can outlive the scope in which they are created, much like dynamically allocating memory lets objects outlive the scope in which they are created.
+
+\begin{cfacode}
+thread MyThread {
+	//...
+};
+
+void main(MyThread& this) {
+	//...
+}
+
+void foo() {
+	MyThread* long_lived;
+	{
+		//Start a thread at the beginning of the scope
+		MyThread short_lived;
+
+		//create another thread that will outlive the thread in this scope
+		long_lived = new MyThread;
+
+		DoStuff();
+
+		//Wait for the thread short_lived to finish
+	}
+	DoMoreStuff();
+
+	//Now wait for the long_lived to finish
+	delete long_lived;
+}
+\end{cfacode}
+
+
+% ======================================================================
+% ======================================================================
+\section{Concurrency}
+% ======================================================================
+% ======================================================================
+Several tools can be used to solve concurrency challenges. Since many of these challenges appear with the use of mutable shared state, some languages and libraries simply disallow mutable shared state (Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka (Scala)~\cite{Akka}). In these paradigms, interaction among concurrent objects relies on message passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts (channels~\cite{CSP,Go} for example). However, in languages that use routine calls as their core abstraction mechanism, these approaches force a clear distinction between concurrent and non-concurrent paradigms (i.e., message passing versus routine calls). This distinction in turn means that, in order to be effective, programmers need to learn two sets of design patterns. While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account.
+
+Approaches based on shared memory are more closely related to non-concurrent paradigms since they often rely on basic constructs like routine calls and shared objects. At the lowest level, concurrent paradigms are implemented as atomic operations and locks. Many such mechanisms have been proposed, including semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}. However, for productivity reasons it is desirable to have a higher-level construct be the core concurrency paradigm~\cite{HPP:Study}.
+
+An approach that is worth mentioning because it is gaining in popularity is transactional memory~\cite{Herlihy93}. While this approach is even pursued by system languages like \CC~\cite{Cpp-Transactions}, the performance and feature set is currently too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
+
+One of the most natural, elegant, and efficient mechanisms for synchronization and communication, especially for shared-memory systems, is the \emph{monitor}. Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}. Many programming languages---e.g., Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs. In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors. For these reasons, this project proposes monitors as the core concurrency construct.
+
+\section{Basics}
+Non-determinism requires concurrent systems to offer support for mutual-exclusion and synchronization. Mutual-exclusion is the concept that only a fixed number of threads can access a critical section at any given time, where a critical section is a group of instructions on an associated portion of data that requires the restricted access. On the other hand, synchronization enforces relative ordering of execution and synchronization tools provide numerous mechanisms to establish timing relationships among threads.
+
+\subsection{Mutual-Exclusion}
+As mentioned above, mutual-exclusion is the guarantee that only a fix number of threads can enter a critical section at once. However, many solutions exist for mutual exclusion, which vary in terms of performance, flexibility and ease of use. Methods range from low-level locks, which are fast and flexible but require significant attention to be correct, to  higher-level concurrency techniques, which sacrifice some performance in order to improve ease of use. Ease of use comes by either guaranteeing some problems cannot occur (e.g., being deadlock free) or by offering a more explicit coupling between data and corresponding critical section. For example, the \CC \code{std::atomic<T>} offers an easy way to express mutual-exclusion on a restricted set of operations (e.g., reading/writing large types atomically). Another challenge with low-level locks is composability. Locks have restricted composability because it takes careful organizing for multiple locks to be used while preventing deadlocks. Easing composability is another feature higher-level mutual-exclusion mechanisms often offer.
+
+\subsection{Synchronization}
+As with mutual-exclusion, low-level synchronization primitives often offer good performance and good flexibility at the cost of ease of use. Again, higher-level mechanisms often simplify usage by adding either better coupling between synchronization and data (e.g., message passing) or offering a simpler solution to otherwise involved challenges. As mentioned above, synchronization can be expressed as guaranteeing that event \textit{X} always happens before \textit{Y}. Most of the time, synchronization happens within a critical section, where threads must acquire mutual-exclusion in a certain order. However, it may also be desirable to guarantee that event \textit{Z} does not occur between \textit{X} and \textit{Y}. Not satisfying this property is called \textbf{barging}. For example, where event \textit{X} tries to effect event \textit{Y} but another thread acquires the critical section and emits \textit{Z} before \textit{Y}. The classic example is the thread that finishes using a resource and unblocks a thread waiting to use the resource, but the unblocked thread must compete to acquire the resource. Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs. This challenge is often split into two different methods, barging avoidance and barging prevention. Algorithms that use flag variables to detect barging threads are said to be using barging avoidance, while algorithms that baton-pass locks~\cite{Andrews89} between threads instead of releasing the locks are said to be using barging prevention.
+
+% ======================================================================
+% ======================================================================
+\section{Monitors}
+% ======================================================================
+% ======================================================================
+A \textbf{monitor} is a set of routines that ensure mutual-exclusion when accessing shared state. More precisely, a monitor is a programming technique that associates mutual-exclusion to routine scopes, as opposed to mutex locks, where mutual-exclusion is defined by lock/release calls independently of any scoping of the calling routine. This strong association eases readability and maintainability, at the cost of flexibility. Note that both monitors and mutex locks, require an abstract handle to identify them. This concept is generally associated with object-oriented languages like Java~\cite{Java} or \uC~\cite{uC++book} but does not strictly require OO semantics. The only requirement is the ability to declare a handle to a shared object and a set of routines that act on it:
+\begin{cfacode}
+typedef /*some monitor type*/ monitor;
+int f(monitor & m);
+
+int main() {
+	monitor m;  //Handle m
+	f(m);       //Routine using handle
+}
+\end{cfacode}
+
+% ======================================================================
+% ======================================================================
+\subsection{Call Semantics} \label{call}
+% ======================================================================
+% ======================================================================
+The above monitor example displays some of the intrinsic characteristics. First, it is necessary to use pass-by-reference over pass-by-value for monitor routines. This semantics is important, because at their core, monitors are implicit mutual-exclusion objects (locks), and these objects cannot be copied. Therefore, monitors are non-copy-able objects (\code{dtype}).
+
+Another aspect to consider is when a monitor acquires its mutual exclusion. For example, a monitor may need to be passed through multiple helper routines that do not acquire the monitor mutual-exclusion on entry. Passthrough can occur for generic helper routines (\code{swap}, \code{sort}, etc.) or specific helper routines like the following to implement an atomic counter:
+
+\begin{cfacode}
+monitor counter_t { /*...see section $\ref{data}$...*/ };
+
+void ?{}(counter_t & nomutex this); //constructor
+size_t ++?(counter_t & mutex this); //increment
+
+//need for mutex is platform dependent
+void ?{}(size_t * this, counter_t & mutex cnt); //conversion
+\end{cfacode}
+This counter is used as follows:
+\begin{center}
+\begin{tabular}{c @{\hskip 0.35in} c @{\hskip 0.35in} c}
+\begin{cfacode}
+//shared counter
+counter_t cnt1, cnt2;
+
+//multiple threads access counter
+thread 1 : cnt1++; cnt2++;
+thread 2 : cnt1++; cnt2++;
+thread 3 : cnt1++; cnt2++;
+	...
+thread N : cnt1++; cnt2++;
+\end{cfacode}
+\end{tabular}
+\end{center}
+Notice how the counter is used without any explicit synchronization and yet supports thread-safe semantics for both reading and writing, which is similar in usage to the \CC template \code{std::atomic}.
+
+Here, the constructor (\code{?\{\}}) uses the \code{nomutex} keyword to signify that it does not acquire the monitor mutual-exclusion when constructing. This semantics is because an object not yet con\-structed should never be shared and therefore does not require mutual exclusion. Furthermore, it allows the implementation greater freedom when it initializes the monitor locking. The prefix increment operator uses \code{mutex} to protect the incrementing process from race conditions. Finally, there is a conversion operator from \code{counter_t} to \code{size_t}. This conversion may or may not require the \code{mutex} keyword depending on whether or not reading a \code{size_t} is an atomic operation.
+
+For maximum usability, monitors use \textbf{multi-acq} semantics, which means a single thread can acquire the same monitor multiple times without deadlock. For example, listing \ref{fig:search} uses recursion and \textbf{multi-acq} to print values inside a binary tree.
+\begin{figure}
+\begin{cfacode}[caption={Recursive printing algorithm using \textbf{multi-acq}.},label={fig:search}]
+monitor printer { ... };
+struct tree {
+	tree * left, right;
+	char * value;
+};
+void print(printer & mutex p, char * v);
+
+void print(printer & mutex p, tree * t) {
+	print(p, t->value);
+	print(p, t->left );
+	print(p, t->right);
+}
+\end{cfacode}
+\end{figure}
+
+Having both \code{mutex} and \code{nomutex} keywords can be redundant, depending on the meaning of a routine having neither of these keywords. For example, it is reasonable that it should default to the safest option (\code{mutex}) when given a routine without qualifiers \code{void foo(counter_t & this)}, whereas assuming \code{nomutex} is unsafe and may cause subtle errors. On the other hand, \code{nomutex} is the ``normal'' parameter behaviour, it effectively states explicitly that ``this routine is not special''. Another alternative is making exactly one of these keywords mandatory, which provides the same semantics but without the ambiguity of supporting routines with neither keyword. Mandatory keywords would also have the added benefit of being self-documented but at the cost of extra typing. While there are several benefits to mandatory keywords, they do bring a few challenges. Mandatory keywords in \CFA would imply that the compiler must know without doubt whether or not a parameter is a monitor or not. Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. For this reason, \CFA only has the \code{mutex} keyword and uses no keyword to mean \code{nomutex}.
+
+The next semantic decision is to establish when \code{mutex} may be used as a type qualifier. Consider the following declarations:
+\begin{cfacode}
+int f1(monitor & mutex m);
+int f2(const monitor & mutex m);
+int f3(monitor ** mutex m);
+int f4(monitor * mutex m []);
+int f5(graph(monitor *) & mutex m);
+\end{cfacode}
+The problem is to identify which object(s) should be acquired. Furthermore, each object needs to be acquired only once. In the case of simple routines like \code{f1} and \code{f2} it is easy to identify an exhaustive list of objects to acquire on entry. Adding indirections (\code{f3}) still allows the compiler and programmer to identify which object is acquired. However, adding in arrays (\code{f4}) makes it much harder. Array lengths are not necessarily known in C, and even then, making sure objects are only acquired once becomes none-trivial. This problem can be extended to absurd limits like \code{f5}, which uses a graph of monitors. To make the issue tractable, this project imposes the requirement that a routine may only acquire one monitor per parameter and it must be the type of the parameter with at most one level of indirection (ignoring potential qualifiers). Also note that while routine \code{f3} can be supported, meaning that monitor \code{**m} is acquired, passing an array to this routine would be type-safe and yet result in undefined behaviour because only the first element of the array is acquired. However, this ambiguity is part of the C type-system with respects to arrays. For this reason, \code{mutex} is disallowed in the context where arrays may be passed:
+\begin{cfacode}
+int f1(monitor & mutex m);    //Okay : recommended case
+int f2(monitor * mutex m);    //Not Okay : Could be an array
+int f3(monitor mutex m []);  //Not Okay : Array of unknown length
+int f4(monitor ** mutex m);   //Not Okay : Could be an array
+int f5(monitor * mutex m []); //Not Okay : Array of unknown length
+\end{cfacode}
+Note that not all array functions are actually distinct in the type system. However, even if the code generation could tell the difference, the extra information is still not sufficient to extend meaningfully the monitor call semantic.
+
+Unlike object-oriented monitors, where calling a mutex member \emph{implicitly} acquires mutual-exclusion of the receiver object, \CFA uses an explicit mechanism to specify the object that acquires mutual-exclusion. A consequence of this approach is that it extends naturally to multi-monitor calls.
+\begin{cfacode}
+int f(MonitorA & mutex a, MonitorB & mutex b);
+
+MonitorA a;
+MonitorB b;
+f(a,b);
+\end{cfacode}
+While OO monitors could be extended with a mutex qualifier for multiple-monitor calls, no example of this feature could be found. The capability to acquire multiple locks before entering a critical section is called \emph{\textbf{bulk-acq}}. In practice, writing multi-locking routines that do not lead to deadlocks is tricky. Having language support for such a feature is therefore a significant asset for \CFA. In the case presented above, \CFA guarantees that the order of acquisition is consistent across calls to different routines using the same monitors as arguments. This consistent ordering means acquiring multiple monitors is safe from deadlock when using \textbf{bulk-acq}. However, users can still force the acquiring order. For example, notice which routines use \code{mutex}/\code{nomutex} and how this affects acquiring order:
+\begin{cfacode}
+void foo(A& mutex a, B& mutex b) { //acquire a & b
+	...
+}
+
+void bar(A& mutex a, B& /*nomutex*/ b) { //acquire a
+	... foo(a, b); ... //acquire b
+}
+
+void baz(A& /*nomutex*/ a, B& mutex b) { //acquire b
+	... foo(a, b); ... //acquire a
+}
+\end{cfacode}
+The \textbf{multi-acq} monitor lock allows a monitor lock to be acquired by both \code{bar} or \code{baz} and acquired again in \code{foo}. In the calls to \code{bar} and \code{baz} the monitors are acquired in opposite order.
+
+However, such use leads to lock acquiring order problems. In the example above, the user uses implicit ordering in the case of function \code{foo} but explicit ordering in the case of \code{bar} and \code{baz}. This subtle difference means that calling these routines concurrently may lead to deadlock and is therefore undefined behaviour. As shown~\cite{Lister77}, solving this problem requires:
+\begin{enumerate}
+	\item Dynamically tracking the monitor-call order.
+	\item Implement rollback semantics.
+\end{enumerate}
+While the first requirement is already a significant constraint on the system, implementing a general rollback semantics in a C-like language is still prohibitively complex~\cite{Dice10}. In \CFA, users simply need to be careful when acquiring multiple monitors at the same time or only use \textbf{bulk-acq} of all the monitors. While \CFA provides only a partial solution, most systems provide no solution and the \CFA partial solution handles many useful cases.
+
+For example, \textbf{multi-acq} and \textbf{bulk-acq} can be used together in interesting ways:
+\begin{cfacode}
+monitor bank { ... };
+
+void deposit( bank & mutex b, int deposit );
+
+void transfer( bank & mutex mybank, bank & mutex yourbank, int me2you) {
+	deposit( mybank, -me2you );
+	deposit( yourbank, me2you );
+}
+\end{cfacode}
+This example shows a trivial solution to the bank-account transfer problem~\cite{BankTransfer}. Without \textbf{multi-acq} and \textbf{bulk-acq}, the solution to this problem is much more involved and requires careful engineering.
+
+\subsection{\code{mutex} statement} \label{mutex-stmt}
+
+The call semantics discussed above have one software engineering issue: only a routine can acquire the mutual-exclusion of a set of monitor. \CFA offers the \code{mutex} statement to work around the need for unnecessary names, avoiding a major software engineering problem~\cite{2FTwoHardThings}. Table \ref{lst:mutex-stmt} shows an example of the \code{mutex} statement, which introduces a new scope in which the mutual-exclusion of a set of monitor is acquired. Beyond naming, the \code{mutex} statement has no semantic difference from a routine call with \code{mutex} parameters.
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c|c|}
+function call & \code{mutex} statement \\
+\hline
+\begin{cfacode}[tabsize=3]
+monitor M {};
+void foo( M & mutex m1, M & mutex m2 ) {
+	//critical section
+}
+
+void bar( M & m1, M & m2 ) {
+	foo( m1, m2 );
+}
+\end{cfacode}&\begin{cfacode}[tabsize=3]
+monitor M {};
+void bar( M & m1, M & m2 ) {
+	mutex(m1, m2) {
+		//critical section
+	}
+}
+
+
+\end{cfacode}
+\end{tabular}
+\end{center}
+\caption{Regular call semantics vs. \code{mutex} statement}
+\label{lst:mutex-stmt}
+\end{table}
+
+% ======================================================================
+% ======================================================================
+\subsection{Data semantics} \label{data}
+% ======================================================================
+% ======================================================================
+Once the call semantics are established, the next step is to establish data semantics. Indeed, until now a monitor is used simply as a generic handle but in most cases monitors contain shared data. This data should be intrinsic to the monitor declaration to prevent any accidental use of data without its appropriate protection. For example, here is a complete version of the counter shown in section \ref{call}:
+\begin{cfacode}
+monitor counter_t {
+	int value;
+};
+
+void ?{}(counter_t & this) {
+	this.cnt = 0;
+}
+
+int ?++(counter_t & mutex this) {
+	return ++this.value;
+}
+
+//need for mutex is platform dependent here
+void ?{}(int * this, counter_t & mutex cnt) {
+	*this = (int)cnt;
+}
+\end{cfacode}
+
+Like threads and coroutines, monitors are defined in terms of traits with some additional language support in the form of the \code{monitor} keyword. The monitor trait is:
+\begin{cfacode}
+trait is_monitor(dtype T) {
+	monitor_desc * get_monitor( T & );
+	void ^?{}( T & mutex );
+};
+\end{cfacode}
+Note that the destructor of a monitor must be a \code{mutex} routine to prevent deallocation while a thread is accessing the monitor. As with any object, calls to a monitor, using \code{mutex} or otherwise, is undefined behaviour after the destructor has run.
+
+% ======================================================================
+% ======================================================================
+\section{Internal Scheduling} \label{intsched}
+% ======================================================================
+% ======================================================================
+In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (i.e., with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (i.e., without access to the shared state). Since internal scheduling within a single monitor is mostly a solved problem, this thesis concentrates on extending internal scheduling to multiple monitors. Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
+
+First, here is a simple example of internal scheduling:
+
+\begin{cfacode}
+monitor A {
+	condition e;
+}
+
+void foo(A& mutex a1, A& mutex a2) {
+	...
+	//Wait for cooperation from bar()
+	wait(a1.e);
+	...
+}
+
+void bar(A& mutex a1, A& mutex a2) {
+	//Provide cooperation for foo()
+	...
+	//Unblock foo
+	signal(a1.e);
+}
+\end{cfacode}
+There are two details to note here. First, \code{signal} is a delayed operation; it only unblocks the waiting thread when it reaches the end of the critical section. This semantics is needed to respect mutual-exclusion, i.e., the signaller and signalled thread cannot be in the monitor simultaneously. The alternative is to return immediately after the call to \code{signal}, which is significantly more restrictive. Second, in \CFA, while it is common to store a \code{condition} as a field of the monitor, a \code{condition} variable can be stored/created independently of a monitor. Here routine \code{foo} waits for the \code{signal} from \code{bar} before making further progress, ensuring a basic ordering.
+
+An important aspect of the implementation is that \CFA does not allow barging, which means that once function \code{bar} releases the monitor, \code{foo} is guaranteed to be the next thread to acquire the monitor (unless some other thread waited on the same condition). This guarantee offers the benefit of not having to loop around waits to recheck that a condition is met. The main reason \CFA offers this guarantee is that users can easily introduce barging if it becomes a necessity but adding barging prevention or barging avoidance is more involved without language support. Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency.
+
+% ======================================================================
+% ======================================================================
+\subsection{Internal Scheduling - Multi-Monitor}
+% ======================================================================
+% ======================================================================
+It is easy to understand the problem of multi-monitor scheduling using a series of pseudo-code examples. Note that for simplicity in the following snippets of pseudo-code, waiting and signalling is done using an implicit condition variable, like Java built-in monitors. Indeed, \code{wait} statements always use the implicit condition variable as parameters and explicitly name the monitors (A and B) associated with the condition. Note that in \CFA, condition variables are tied to a \emph{group} of monitors on first use (called branding), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors. The example below shows the simple case of having two threads (one for each column) and a single monitor A.
+
+\begin{multicols}{2}
+thread 1
+\begin{pseudo}
+acquire A
+	wait A
+release A
+\end{pseudo}
+
+\columnbreak
+
+thread 2
+\begin{pseudo}
+acquire A
+	signal A
+release A
+\end{pseudo}
+\end{multicols}
+One thread acquires before waiting (atomically blocking and releasing A) and the other acquires before signalling. It is important to note here that both \code{wait} and \code{signal} must be called with the proper monitor(s) already acquired. This semantic is a logical requirement for barging prevention.
+
+A direct extension of the previous example is a \textbf{bulk-acq} version:
+\begin{multicols}{2}
+\begin{pseudo}
+acquire A & B
+	wait A & B
+release A & B
+\end{pseudo}
+\columnbreak
+\begin{pseudo}
+acquire A & B
+	signal A & B
+release A & B
+\end{pseudo}
+\end{multicols}
+\noindent This version uses \textbf{bulk-acq} (denoted using the {\sf\&} symbol), but the presence of multiple monitors does not add a particularly new meaning. Synchronization happens between the two threads in exactly the same way and order. The only difference is that mutual exclusion covers a group of monitors. On the implementation side, handling multiple monitors does add a degree of complexity as the next few examples demonstrate.
+
+While deadlock issues can occur when nesting monitors, these issues are only a symptom of the fact that locks, and by extension monitors, are not perfectly composable. For monitors, a well-known deadlock problem is the Nested Monitor Problem~\cite{Lister77}, which occurs when a \code{wait} is made by a thread that holds more than one monitor. For example, the following pseudo-code runs into the nested-monitor problem:
+\begin{multicols}{2}
+\begin{pseudo}
+acquire A
+	acquire B
+		wait B
+	release B
+release A
+\end{pseudo}
+
+\columnbreak
+
+\begin{pseudo}
+acquire A
+	acquire B
+		signal B
+	release B
+release A
+\end{pseudo}
+\end{multicols}
+\noindent The \code{wait} only releases monitor \code{B} so the signalling thread cannot acquire monitor \code{A} to get to the \code{signal}. Attempting release of all acquired monitors at the \code{wait} introduces a different set of problems, such as releasing monitor \code{C}, which has nothing to do with the \code{signal}.
+
+However, for monitors as for locks, it is possible to write a program using nesting without encountering any problems if nesting is done correctly. For example, the next pseudo-code snippet acquires monitors {\sf A} then {\sf B} before waiting, while only acquiring {\sf B} when signalling, effectively avoiding the Nested Monitor Problem~\cite{Lister77}.
+
+\begin{multicols}{2}
+\begin{pseudo}
+acquire A
+	acquire B
+		wait B
+	release B
+release A
+\end{pseudo}
+
+\columnbreak
+
+\begin{pseudo}
+
+acquire B
+	signal B
+release B
+
+\end{pseudo}
+\end{multicols}
+
+\noindent However, this simple refactoring may not be possible, forcing more complex restructuring.
+
+% ======================================================================
+% ======================================================================
+\subsection{Internal Scheduling - In Depth}
+% ======================================================================
+% ======================================================================
+
+A larger example is presented to show complex issues for \textbf{bulk-acq} and its implementation options are analyzed. Listing \ref{lst:int-bulk-pseudo} shows an example where \textbf{bulk-acq} adds a significant layer of complexity to the internal signalling semantics, and listing \ref{lst:int-bulk-cfa} shows the corresponding \CFA code to implement the pseudo-code in listing \ref{lst:int-bulk-pseudo}. For the purpose of translating the given pseudo-code into \CFA-code, any method of introducing a monitor is acceptable, e.g., \code{mutex} parameters, global variables, pointer parameters, or using locals with the \code{mutex} statement.
+
+\begin{figure}[!t]
+\begin{multicols}{2}
+Waiting thread
+\begin{pseudo}[numbers=left]
+acquire A
+	//Code Section 1
+	acquire A & B
+		//Code Section 2
+		wait A & B
+		//Code Section 3
+	release A & B
+	//Code Section 4
+release A
+\end{pseudo}
+\columnbreak
+Signalling thread
+\begin{pseudo}[numbers=left, firstnumber=10,escapechar=|]
+acquire A
+	//Code Section 5
+	acquire A & B
+		//Code Section 6
+		|\label{line:signal1}|signal A & B
+		//Code Section 7
+	|\label{line:releaseFirst}|release A & B
+	//Code Section 8
+|\label{line:lastRelease}|release A
+\end{pseudo}
+\end{multicols}
+\begin{cfacode}[caption={Internal scheduling with \textbf{bulk-acq}},label={lst:int-bulk-pseudo}]
+\end{cfacode}
+\begin{center}
+\begin{cfacode}[xleftmargin=.4\textwidth]
+monitor A a;
+monitor B b;
+condition c;
+\end{cfacode}
+\end{center}
+\begin{multicols}{2}
+Waiting thread
+\begin{cfacode}
+mutex(a) {
+	//Code Section 1
+	mutex(a, b) {
+		//Code Section 2
+		wait(c);
+		//Code Section 3
+	}
+	//Code Section 4
+}
+\end{cfacode}
+\columnbreak
+Signalling thread
+\begin{cfacode}
+mutex(a) {
+	//Code Section 5
+	mutex(a, b) {
+		//Code Section 6
+		signal(c);
+		//Code Section 7
+	}
+	//Code Section 8
+}
+\end{cfacode}
+\end{multicols}
+\begin{cfacode}[caption={Equivalent \CFA code for listing \ref{lst:int-bulk-pseudo}},label={lst:int-bulk-cfa}]
+\end{cfacode}
+\begin{multicols}{2}
+Waiter
+\begin{pseudo}[numbers=left]
+acquire A
+	acquire A & B
+		wait A & B
+	release A & B
+release A
+\end{pseudo}
+
+\columnbreak
+
+Signaller
+\begin{pseudo}[numbers=left, firstnumber=6,escapechar=|]
+acquire A
+	acquire A & B
+		signal A & B
+	release A & B
+	|\label{line:secret}|//Secretly keep B here
+release A
+//Wakeup waiter and transfer A & B
+\end{pseudo}
+\end{multicols}
+\begin{cfacode}[caption={Listing \ref{lst:int-bulk-pseudo}, with delayed signalling comments},label={lst:int-secret}]
+\end{cfacode}
+\end{figure}
+
+The complexity begins at code sections 4 and 8 in listing \ref{lst:int-bulk-pseudo}, which are where the existing semantics of internal scheduling needs to be extended for multiple monitors. The root of the problem is that \textbf{bulk-acq} is used in a context where one of the monitors is already acquired, which is why it is important to define the behaviour of the previous pseudo-code. When the signaller thread reaches the location where it should ``release \code{A & B}'' (listing \ref{lst:int-bulk-pseudo} line \ref{line:releaseFirst}), it must actually transfer ownership of monitor \code{B} to the waiting thread. This ownership transfer is required in order to prevent barging into \code{B} by another thread, since both the signalling and signalled threads still need monitor \code{A}. There are three options:
+
+\subsubsection{Delaying Signals}
+The obvious solution to the problem of multi-monitor scheduling is to keep ownership of all locks until the last lock is ready to be transferred. It can be argued that that moment is when the last lock is no longer needed, because this semantics fits most closely to the behaviour of single-monitor scheduling. This solution has the main benefit of transferring ownership of groups of monitors, which simplifies the semantics from multiple objects to a single group of objects, effectively making the existing single-monitor semantic viable by simply changing monitors to monitor groups. This solution releases the monitors once every monitor in a group can be released. However, since some monitors are never released (e.g., the monitor of a thread), this interpretation means a group might never be released. A more interesting interpretation is to transfer the group until all its monitors are released, which means the group is not passed further and a thread can retain its locks.
+
+However, listing \ref{lst:int-secret} shows this solution can become much more complicated depending on what is executed while secretly holding B at line \ref{line:secret}, while avoiding the need to transfer ownership of a subset of the condition monitors. Listing \ref{lst:dependency} shows a slightly different example where a third thread is waiting on monitor \code{A}, using a different condition variable. Because the third thread is signalled when secretly holding \code{B}, the goal  becomes unreachable. Depending on the order of signals (listing \ref{lst:dependency} line \ref{line:signal-ab} and \ref{line:signal-a}) two cases can happen:
+
+\paragraph{Case 1: thread $\alpha$ goes first.} In this case, the problem is that monitor \code{A} needs to be passed to thread $\beta$ when thread $\alpha$ is done with it.
+\paragraph{Case 2: thread $\beta$ goes first.} In this case, the problem is that monitor \code{B} needs to be retained and passed to thread $\alpha$ along with monitor \code{A}, which can be done directly or possibly using thread $\beta$ as an intermediate.
+\\
+
+Note that ordering is not determined by a race condition but by whether signalled threads are enqueued in FIFO or FILO order. However, regardless of the answer, users can move line \ref{line:signal-a} before line \ref{line:signal-ab} and get the reverse effect for listing \ref{lst:dependency}.
+
+In both cases, the threads need to be able to distinguish, on a per monitor basis, which ones need to be released and which ones need to be transferred, which means knowing when to release a group becomes complex and inefficient (see next section) and therefore effectively precludes this approach.
+
+\subsubsection{Dependency graphs}
+
+
+\begin{figure}
+\begin{multicols}{3}
+Thread $\alpha$
+\begin{pseudo}[numbers=left, firstnumber=1]
+acquire A
+	acquire A & B
+		wait A & B
+	release A & B
+release A
+\end{pseudo}
+\columnbreak
+Thread $\gamma$
+\begin{pseudo}[numbers=left, firstnumber=6, escapechar=|]
+acquire A
+	acquire A & B
+		|\label{line:signal-ab}|signal A & B
+	|\label{line:release-ab}|release A & B
+	|\label{line:signal-a}|signal A
+|\label{line:release-a}|release A
+\end{pseudo}
+\columnbreak
+Thread $\beta$
+\begin{pseudo}[numbers=left, firstnumber=12, escapechar=|]
+acquire A
+	wait A
+|\label{line:release-aa}|release A
+\end{pseudo}
+\end{multicols}
+\begin{cfacode}[caption={Pseudo-code for the three thread example.},label={lst:dependency}]
+\end{cfacode}
+\begin{center}
+\input{dependency}
+\end{center}
+\caption{Dependency graph of the statements in listing \ref{lst:dependency}}
+\label{fig:dependency}
+\end{figure}
+
+In listing \ref{lst:int-bulk-pseudo}, there is a solution that satisfies both barging prevention and mutual exclusion. If ownership of both monitors is transferred to the waiter when the signaller releases \code{A & B} and then the waiter transfers back ownership of \code{A} back to the signaller when it releases it, then the problem is solved (\code{B} is no longer in use at this point). Dynamically finding the correct order is therefore the second possible solution. The problem is effectively resolving a dependency graph of ownership requirements. Here even the simplest of code snippets requires two transfers and has a super-linear complexity. This complexity can be seen in listing \ref{lst:explosion}, which is just a direct extension to three monitors, requires at least three ownership transfer and has multiple solutions. Furthermore, the presence of multiple solutions for ownership transfer can cause deadlock problems if a specific solution is not consistently picked; In the same way that multiple lock acquiring order can cause deadlocks.
+\begin{figure}
+\begin{multicols}{2}
+\begin{pseudo}
+acquire A
+	acquire B
+		acquire C
+			wait A & B & C
+		release C
+	release B
+release A
+\end{pseudo}
+
+\columnbreak
+
+\begin{pseudo}
+acquire A
+	acquire B
+		acquire C
+			signal A & B & C
+		release C
+	release B
+release A
+\end{pseudo}
+\end{multicols}
+\begin{cfacode}[caption={Extension to three monitors of listing \ref{lst:int-bulk-pseudo}},label={lst:explosion}]
+\end{cfacode}
+\end{figure}
+
+Given the three threads example in listing \ref{lst:dependency}, figure \ref{fig:dependency} shows the corresponding dependency graph that results, where every node is a statement of one of the three threads, and the arrows the dependency of that statement (e.g., $\alpha1$ must happen before $\alpha2$). The extra challenge is that this dependency graph is effectively post-mortem, but the runtime system needs to be able to build and solve these graphs as the dependencies unfold. Resolving dependency graphs being a complex and expensive endeavour, this solution is not the preferred one.
+
+\subsubsection{Partial Signalling} \label{partial-sig}
+Finally, the solution that is chosen for \CFA is to use partial signalling. Again using listing \ref{lst:int-bulk-pseudo}, the partial signalling solution transfers ownership of monitor \code{B} at lines \ref{line:signal1} to the waiter but does not wake the waiting thread since it is still using monitor \code{A}. Only when it reaches line \ref{line:lastRelease} does it actually wake up the waiting thread. This solution has the benefit that complexity is encapsulated into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. This solution has a much simpler implementation than a dependency graph solving algorithms, which is why it was chosen. Furthermore, after being fully implemented, this solution does not appear to have any significant downsides.
+
+Using partial signalling, listing \ref{lst:dependency} can be solved easily:
+\begin{itemize}
+	\item When thread $\gamma$ reaches line \ref{line:release-ab} it transfers monitor \code{B} to thread $\alpha$ and continues to hold monitor \code{A}.
+	\item When thread $\gamma$ reaches line \ref{line:release-a}  it transfers monitor \code{A} to thread $\beta$  and wakes it up.
+	\item When thread $\beta$  reaches line \ref{line:release-aa} it transfers monitor \code{A} to thread $\alpha$ and wakes it up.
+\end{itemize}
+
+% ======================================================================
+% ======================================================================
+\subsection{Signalling: Now or Later}
+% ======================================================================
+% ======================================================================
+\begin{table}
+\begin{tabular}{|c|c|}
+\code{signal} & \code{signal_block} \\
+\hline
+\begin{cfacode}[tabsize=3]
+monitor DatingService
+{
+	//compatibility codes
+	enum{ CCodes = 20 };
+
+	int girlPhoneNo
+	int boyPhoneNo;
+};
+
+condition girls[CCodes];
+condition boys [CCodes];
+condition exchange;
+
+int girl(int phoneNo, int ccode)
+{
+	//no compatible boy ?
+	if(empty(boys[ccode]))
+	{
+		//wait for boy
+		wait(girls[ccode]);
+
+		//make phone number available
+		girlPhoneNo = phoneNo;
+
+		//wake boy from chair
+		signal(exchange);
+	}
+	else
+	{
+		//make phone number available
+		girlPhoneNo = phoneNo;
+
+		//wake boy
+		signal(boys[ccode]);
+
+		//sit in chair
+		wait(exchange);
+	}
+	return boyPhoneNo;
+}
+
+int boy(int phoneNo, int ccode)
+{
+	//same as above
+	//with boy/girl interchanged
+}
+\end{cfacode}&\begin{cfacode}[tabsize=3]
+monitor DatingService
+{
+	//compatibility codes
+	enum{ CCodes = 20 };
+
+	int girlPhoneNo;
+	int boyPhoneNo;
+};
+
+condition girls[CCodes];
+condition boys [CCodes];
+//exchange is not needed
+
+int girl(int phoneNo, int ccode)
+{
+	//no compatible boy ?
+	if(empty(boys[ccode]))
+	{
+		//wait for boy
+		wait(girls[ccode]);
+
+		//make phone number available
+		girlPhoneNo = phoneNo;
+
+		//wake boy from chair
+		signal(exchange);
+	}
+	else
+	{
+		//make phone number available
+		girlPhoneNo = phoneNo;
+
+		//wake boy
+		signal_block(boys[ccode]);
+
+		//second handshake unnecessary
+
+	}
+	return boyPhoneNo;
+}
+
+int boy(int phoneNo, int ccode)
+{
+	//same as above
+	//with boy/girl interchanged
+}
+\end{cfacode}
+\end{tabular}
+\caption{Dating service example using \code{signal} and \code{signal_block}. }
+\label{tbl:datingservice}
+\end{table}
+An important note is that, until now, signalling a monitor was a delayed operation. The ownership of the monitor is transferred only when the monitor would have otherwise been released, not at the point of the \code{signal} statement. However, in some cases, it may be more convenient for users to immediately transfer ownership to the thread that is waiting for cooperation, which is achieved using the \code{signal_block} routine.
+
+The example in table \ref{tbl:datingservice} highlights the difference in behaviour. As mentioned, \code{signal} only transfers ownership once the current critical section exits; this behaviour requires additional synchronization when a two-way handshake is needed. To avoid this explicit synchronization, the \code{condition} type offers the \code{signal_block} routine, which handles the two-way handshake as shown in the example. This feature removes the need for a second condition variables and simplifies programming. Like every other monitor semantic, \code{signal_block} uses barging prevention, which means mutual-exclusion is baton-passed both on the front end and the back end of the call to \code{signal_block}, meaning no other thread can acquire the monitor either before or after the call.
+
+% ======================================================================
+% ======================================================================
+\section{External scheduling} \label{extsched}
+% ======================================================================
+% ======================================================================
+An alternative to internal scheduling is external scheduling (see Table~\ref{tbl:sched}).
+\begin{table}
+\begin{tabular}{|c|c|c|}
+Internal Scheduling & External Scheduling & Go\\
+\hline
+\begin{ucppcode}[tabsize=3]
+_Monitor Semaphore {
+	condition c;
+	bool inUse;
+public:
+	void P() {
+		if(inUse)
+			wait(c);
+		inUse = true;
+	}
+	void V() {
+		inUse = false;
+		signal(c);
+	}
+}
+\end{ucppcode}&\begin{ucppcode}[tabsize=3]
+_Monitor Semaphore {
+
+	bool inUse;
+public:
+	void P() {
+		if(inUse)
+			_Accept(V);
+		inUse = true;
+	}
+	void V() {
+		inUse = false;
+
+	}
+}
+\end{ucppcode}&\begin{gocode}[tabsize=3]
+type MySem struct {
+	inUse bool
+	c     chan bool
+}
+
+// acquire
+func (s MySem) P() {
+	if s.inUse {
+		select {
+		case <-s.c:
+		}
+	}
+	s.inUse = true
+}
+
+// release
+func (s MySem) V() {
+	s.inUse = false
+
+	//This actually deadlocks
+	//when single thread
+	s.c <- false
+}
+\end{gocode}
+\end{tabular}
+\caption{Different forms of scheduling.}
+\label{tbl:sched}
+\end{table}
+This method is more constrained and explicit, which helps users reduce the non-deterministic nature of concurrency. Indeed, as the following examples demonstrate, external scheduling allows users to wait for events from other threads without the concern of unrelated events occurring. External scheduling can generally be done either in terms of control flow (e.g., Ada with \code{accept}, \uC with \code{_Accept}) or in terms of data (e.g., Go with channels). Of course, both of these paradigms have their own strengths and weaknesses, but for this project, control-flow semantics was chosen to stay consistent with the rest of the languages semantics. Two challenges specific to \CFA arise when trying to add external scheduling with loose object definitions and multiple-monitor routines. The previous example shows a simple use \code{_Accept} versus \code{wait}/\code{signal} and its advantages. Note that while other languages often use \code{accept}/\code{select} as the core external scheduling keyword, \CFA uses \code{waitfor} to prevent name collisions with existing socket \textbf{api}s.
+
+For the \code{P} member above using internal scheduling, the call to \code{wait} only guarantees that \code{V} is the last routine to access the monitor, allowing a third routine, say \code{isInUse()}, acquire mutual exclusion several times while routine \code{P} is waiting. On the other hand, external scheduling guarantees that while routine \code{P} is waiting, no other routine than \code{V} can acquire the monitor.
+
+% ======================================================================
+% ======================================================================
+\subsection{Loose Object Definitions}
+% ======================================================================
+% ======================================================================
+In \uC, a monitor class declaration includes an exhaustive list of monitor operations. Since \CFA is not object oriented, monitors become both more difficult to implement and less clear for a user:
+
+\begin{cfacode}
+monitor A {};
+
+void f(A & mutex a);
+void g(A & mutex a) {
+	waitfor(f); //Obvious which f() to wait for
+}
+
+void f(A & mutex a, int); //New different F added in scope
+void h(A & mutex a) {
+	waitfor(f); //Less obvious which f() to wait for
+}
+\end{cfacode}
+
+Furthermore, external scheduling is an example where implementation constraints become visible from the interface. Here is the pseudo-code for the entering phase of a monitor:
+\begin{center}
+\begin{tabular}{l}
+\begin{pseudo}
+	if monitor is free
+		enter
+	elif already own the monitor
+		continue
+	elif monitor accepts me
+		enter
+	else
+		block
+\end{pseudo}
+\end{tabular}
+\end{center}
+For the first two conditions, it is easy to implement a check that can evaluate the condition in a few instructions. However, a fast check for \pscode{monitor accepts me} is much harder to implement depending on the constraints put on the monitors. Indeed, monitors are often expressed as an entry queue and some acceptor queue as in Figure~\ref{fig:ClassicalMonitor}.
+
+\begin{figure}
+\centering
+\subfloat[Classical Monitor] {
+\label{fig:ClassicalMonitor}
+{\resizebox{0.45\textwidth}{!}{\input{monitor}}}
+}% subfloat
+\qquad
+\subfloat[\textbf{bulk-acq} Monitor] {
+\label{fig:BulkMonitor}
+{\resizebox{0.45\textwidth}{!}{\input{ext_monitor}}}
+}% subfloat
+\caption{External Scheduling Monitor}
+\end{figure}
+
+There are other alternatives to these pictures, but in the case of the left picture, implementing a fast accept check is relatively easy. Restricted to a fixed number of mutex members, N, the accept check reduces to updating a bitmask when the acceptor queue changes, a check that executes in a single instruction even with a fairly large number (e.g., 128) of mutex members. This approach requires a unique dense ordering of routines with an upper-bound and that ordering must be consistent across translation units. For OO languages these constraints are common, since objects only offer adding member routines consistently across translation units via inheritance. However, in \CFA users can extend objects with mutex routines that are only visible in certain translation unit. This means that establishing a program-wide dense-ordering among mutex routines can only be done in the program linking phase, and still could have issues when using dynamically shared objects.
+
+The alternative is to alter the implementation as in Figure~\ref{fig:BulkMonitor}.
+Here, the mutex routine called is associated with a thread on the entry queue while a list of acceptable routines is kept separate. Generating a mask dynamically means that the storage for the mask information can vary between calls to \code{waitfor}, allowing for more flexibility and extensions. Storing an array of accepted function pointers replaces the single instruction bitmask comparison with dereferencing a pointer followed by a linear search. Furthermore, supporting nested external scheduling (e.g., listing \ref{lst:nest-ext}) may now require additional searches for the \code{waitfor} statement to check if a routine is already queued.
+
+\begin{figure}
+\begin{cfacode}[caption={Example of nested external scheduling},label={lst:nest-ext}]
+monitor M {};
+void foo( M & mutex a ) {}
+void bar( M & mutex b ) {
+	//Nested in the waitfor(bar, c) call
+	waitfor(foo, b);
+}
+void baz( M & mutex c ) {
+	waitfor(bar, c);
+}
+
+\end{cfacode}
+\end{figure}
+
+Note that in the right picture, tasks need to always keep track of the monitors associated with mutex routines, and the routine mask needs to have both a function pointer and a set of monitors, as is discussed in the next section. These details are omitted from the picture for the sake of simplicity.
+
+At this point, a decision must be made between flexibility and performance. Many design decisions in \CFA achieve both flexibility and performance, for example polymorphic routines add significant flexibility but inlining them means the optimizer can easily remove any runtime cost. Here, however, the cost of flexibility cannot be trivially removed. In the end, the most flexible approach has been chosen since it allows users to write programs that would otherwise be  hard to write. This decision is based on the assumption that writing fast but inflexible locks is closer to a solved problem than writing locks that are as flexible as external scheduling in \CFA.
+
+% ======================================================================
+% ======================================================================
+\subsection{Multi-Monitor Scheduling}
+% ======================================================================
+% ======================================================================
+
+External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax. Even in the simplest possible case, some new semantics needs to be established:
+\begin{cfacode}
+monitor M {};
+
+void f(M & mutex a);
+
+void g(M & mutex b, M & mutex c) {
+	waitfor(f); //two monitors M => unknown which to pass to f(M & mutex)
+}
+\end{cfacode}
+The obvious solution is to specify the correct monitor as follows:
+
+\begin{cfacode}
+monitor M {};
+
+void f(M & mutex a);
+
+void g(M & mutex a, M & mutex b) {
+	//wait for call to f with argument b
+	waitfor(f, b);
+}
+\end{cfacode}
+This syntax is unambiguous. Both locks are acquired and kept by \code{g}. When routine \code{f} is called, the lock for monitor \code{b} is temporarily transferred from \code{g} to \code{f} (while \code{g} still holds lock \code{a}). This behaviour can be extended to the multi-monitor \code{waitfor} statement as follows.
+
+\begin{cfacode}
+monitor M {};
+
+void f(M & mutex a, M & mutex b);
+
+void g(M & mutex a, M & mutex b) {
+	//wait for call to f with arguments a and b
+	waitfor(f, a, b);
+}
+\end{cfacode}
+
+Note that the set of monitors passed to the \code{waitfor} statement must be entirely contained in the set of monitors already acquired in the routine. \code{waitfor} used in any other context is undefined behaviour.
+
+An important behaviour to note is when a set of monitors only match partially:
+
+\begin{cfacode}
+mutex struct A {};
+
+mutex struct B {};
+
+void g(A & mutex a, B & mutex b) {
+	waitfor(f, a, b);
+}
+
+A a1, a2;
+B b;
+
+void foo() {
+	g(a1, b); //block on accept
+}
+
+void bar() {
+	f(a2, b); //fulfill cooperation
+}
+\end{cfacode}
+While the equivalent can happen when using internal scheduling, the fact that conditions are specific to a set of monitors means that users have to use two different condition variables. In both cases, partially matching monitor sets does not wakeup the waiting thread. It is also important to note that in the case of external scheduling the order of parameters is irrelevant; \code{waitfor(f,a,b)} and \code{waitfor(f,b,a)} are indistinguishable waiting condition.
+
+% ======================================================================
+% ======================================================================
+\subsection{\code{waitfor} Semantics}
+% ======================================================================
+% ======================================================================
+
+Syntactically, the \code{waitfor} statement takes a function identifier and a set of monitors. While the set of monitors can be any list of expressions, the function name is more restricted because the compiler validates at compile time the validity of the function type and the parameters used with the \code{waitfor} statement. It checks that the set of monitors passed in matches the requirements for a function call. Listing \ref{lst:waitfor} shows various usages of the waitfor statement and which are acceptable. The choice of the function type is made ignoring any non-\code{mutex} parameter. One limitation of the current implementation is that it does not handle overloading, but overloading is possible.
+\begin{figure}
+\begin{cfacode}[caption={Various correct and incorrect uses of the waitfor statement},label={lst:waitfor}]
+monitor A{};
+monitor B{};
+
+void f1( A & mutex );
+void f2( A & mutex, B & mutex );
+void f3( A & mutex, int );
+void f4( A & mutex, int );
+void f4( A & mutex, double );
+
+void foo( A & mutex a1, A & mutex a2, B & mutex b1, B & b2 ) {
+	A * ap = & a1;
+	void (*fp)( A & mutex ) = f1;
+
+	waitfor(f1, a1);     //Correct : 1 monitor case
+	waitfor(f2, a1, b1); //Correct : 2 monitor case
+	waitfor(f3, a1);     //Correct : non-mutex arguments are ignored
+	waitfor(f1, *ap);    //Correct : expression as argument
+
+	waitfor(f1, a1, b1); //Incorrect : Too many mutex arguments
+	waitfor(f2, a1);     //Incorrect : Too few mutex arguments
+	waitfor(f2, a1, a2); //Incorrect : Mutex arguments don't match
+	waitfor(f1, 1);      //Incorrect : 1 not a mutex argument
+	waitfor(f9, a1);     //Incorrect : f9 function does not exist
+	waitfor(*fp, a1 );   //Incorrect : fp not an identifier
+	waitfor(f4, a1);     //Incorrect : f4 ambiguous
+
+	waitfor(f2, a1, b2); //Undefined behaviour : b2 not mutex
+}
+\end{cfacode}
+\end{figure}
+
+Finally, for added flexibility, \CFA supports constructing a complex \code{waitfor} statement using the \code{or}, \code{timeout} and \code{else}. Indeed, multiple \code{waitfor} clauses can be chained together using \code{or}; this chain forms a single statement that uses baton pass to any function that fits one of the function+monitor set passed in. To enable users to tell which accepted function executed, \code{waitfor}s are followed by a statement (including the null statement \code{;}) or a compound statement, which is executed after the clause is triggered. A \code{waitfor} chain can also be followed by a \code{timeout}, to signify an upper bound on the wait, or an \code{else}, to signify that the call should be non-blocking, which checks for a matching function call already arrived and otherwise continues. Any and all of these clauses can be preceded by a \code{when} condition to dynamically toggle the accept clauses on or off based on some current state. Listing \ref{lst:waitfor2} demonstrates several complex masks and some incorrect ones.
+
+\begin{figure}
+\begin{cfacode}[caption={Various correct and incorrect uses of the or, else, and timeout clause around a waitfor statement},label={lst:waitfor2}]
+monitor A{};
+
+void f1( A & mutex );
+void f2( A & mutex );
+
+void foo( A & mutex a, bool b, int t ) {
+	//Correct : blocking case
+	waitfor(f1, a);
+
+	//Correct : block with statement
+	waitfor(f1, a) {
+		sout | "f1" | endl;
+	}
+
+	//Correct : block waiting for f1 or f2
+	waitfor(f1, a) {
+		sout | "f1" | endl;
+	} or waitfor(f2, a) {
+		sout | "f2" | endl;
+	}
+
+	//Correct : non-blocking case
+	waitfor(f1, a); or else;
+
+	//Correct : non-blocking case
+	waitfor(f1, a) {
+		sout | "blocked" | endl;
+	} or else {
+		sout | "didn't block" | endl;
+	}
+
+	//Correct : block at most 10 seconds
+	waitfor(f1, a) {
+		sout | "blocked" | endl;
+	} or timeout( 10`s) {
+		sout | "didn't block" | endl;
+	}
+
+	//Correct : block only if b == true
+	//if b == false, don't even make the call
+	when(b) waitfor(f1, a);
+
+	//Correct : block only if b == true
+	//if b == false, make non-blocking call
+	waitfor(f1, a); or when(!b) else;
+
+	//Correct : block only of t > 1
+	waitfor(f1, a); or when(t > 1) timeout(t); or else;
+
+	//Incorrect : timeout clause is dead code
+	waitfor(f1, a); or timeout(t); or else;
+
+	//Incorrect : order must be
+	//waitfor [or waitfor... [or timeout] [or else]]
+	timeout(t); or waitfor(f1, a); or else;
+}
+\end{cfacode}
+\end{figure}
+
+% ======================================================================
+% ======================================================================
+\subsection{Waiting For The Destructor}
+% ======================================================================
+% ======================================================================
+An interesting use for the \code{waitfor} statement is destructor semantics. Indeed, the \code{waitfor} statement can accept any \code{mutex} routine, which includes the destructor (see section \ref{data}). However, with the semantics discussed until now, waiting for the destructor does not make any sense, since using an object after its destructor is called is undefined behaviour. The simplest approach is to disallow \code{waitfor} on a destructor. However, a more expressive approach is to flip ordering of execution when waiting for the destructor, meaning that waiting for the destructor allows the destructor to run after the current \code{mutex} routine, similarly to how a condition is signalled.
+\begin{figure}
+\begin{cfacode}[caption={Example of an executor which executes action in series until the destructor is called.},label={lst:dtor-order}]
+monitor Executer {};
+struct  Action;
+
+void ^?{}   (Executer & mutex this);
+void execute(Executer & mutex this, const Action & );
+void run    (Executer & mutex this) {
+	while(true) {
+		   waitfor(execute, this);
+		or waitfor(^?{}   , this) {
+			break;
+		}
+	}
+}
+\end{cfacode}
+\end{figure}
+For example, listing \ref{lst:dtor-order} shows an example of an executor with an infinite loop, which waits for the destructor to break out of this loop. Switching the semantic meaning introduces an idiomatic way to terminate a task and/or wait for its termination via destruction.
+
+
+% ######     #    ######     #    #       #       ####### #       ###  #####  #     #
+% #     #   # #   #     #   # #   #       #       #       #        #  #     # ##   ##
+% #     #  #   #  #     #  #   #  #       #       #       #        #  #       # # # #
+% ######  #     # ######  #     # #       #       #####   #        #   #####  #  #  #
+% #       ####### #   #   ####### #       #       #       #        #        # #     #
+% #       #     # #    #  #     # #       #       #       #        #  #     # #     #
+% #       #     # #     # #     # ####### ####### ####### ####### ###  #####  #     #
+\section{Parallelism}
+Historically, computer performance was about processor speeds and instruction counts. However, with heat dissipation being a direct consequence of speed increase, parallelism has become the new source for increased performance~\cite{Sutter05, Sutter05b}. In this decade, it is no longer reasonable to create a high-performance application without caring about parallelism. Indeed, parallelism is an important aspect of performance and more specifically throughput and hardware utilization. The lowest-level approach of parallelism is to use \textbf{kthread} in combination with semantics like \code{fork}, \code{join}, etc. However, since these have significant costs and limitations, \textbf{kthread} are now mostly used as an implementation tool rather than a user oriented one. There are several alternatives to solve these issues that all have strengths and weaknesses. While there are many variations of the presented paradigms, most of these variations do not actually change the guarantees or the semantics, they simply move costs in order to achieve better performance for certain workloads.
+
+\section{Paradigms}
+\subsection{User-Level Threads}
+A direct improvement on the \textbf{kthread} approach is to use \textbf{uthread}. These threads offer most of the same features that the operating system already provides but can be used on a much larger scale. This approach is the most powerful solution as it allows all the features of multithreading, while removing several of the more expensive costs of kernel threads. The downside is that almost none of the low-level threading problems are hidden; users still have to think about data races, deadlocks and synchronization issues. These issues can be somewhat alleviated by a concurrency toolkit with strong guarantees, but the parallelism toolkit offers very little to reduce complexity in itself.
+
+Examples of languages that support \textbf{uthread} are Erlang~\cite{Erlang} and \uC~\cite{uC++book}.
+
+\subsection{Fibers : User-Level Threads Without Preemption} \label{fibers}
+A popular variant of \textbf{uthread} is what is often referred to as \textbf{fiber}. However, \textbf{fiber} do not present meaningful semantic differences with \textbf{uthread}. The significant difference between \textbf{uthread} and \textbf{fiber} is the lack of \textbf{preemption} in the latter. Advocates of \textbf{fiber} list their high performance and ease of implementation as major strengths, but the performance difference between \textbf{uthread} and \textbf{fiber} is controversial, and the ease of implementation, while true, is a weak argument in the context of language design. Therefore this proposal largely ignores fibers.
+
+An example of a language that uses fibers is Go~\cite{Go}
+
+\subsection{Jobs and Thread Pools}
+An approach on the opposite end of the spectrum is to base parallelism on \textbf{pool}. Indeed, \textbf{pool} offer limited flexibility but at the benefit of a simpler user interface. In \textbf{pool} based systems, users express parallelism as units of work, called jobs, and a dependency graph (either explicit or implicit) that ties them together. This approach means users need not worry about concurrency but significantly limit the interaction that can occur among jobs. Indeed, any \textbf{job} that blocks also block the underlying worker, which effectively means the CPU utilization, and therefore throughput, suffers noticeably. It can be argued that a solution to this problem is to use more workers than available cores. However, unless the number of jobs and the number of workers are comparable, having a significant number of blocked jobs always results in idles cores.
+
+The gold standard of this implementation is Intel's TBB library~\cite{TBB}.
+
+\subsection{Paradigm Performance}
+While the choice between the three paradigms listed above may have significant performance implications, it is difficult to pin down the performance implications of choosing a model at the language level. Indeed, in many situations one of these paradigms may show better performance but it all strongly depends on the workload. Having a large amount of mostly independent units of work to execute almost guarantees equivalent performance across paradigms and that the \textbf{pool}-based system has the best efficiency thanks to the lower memory overhead (i.e., no thread stack per job). However, interactions among jobs can easily exacerbate contention. User-level threads allow fine-grain context switching, which results in better resource utilization, but a context switch is more expensive and the extra control means users need to tweak more variables to get the desired performance. Finally, if the units of uninterrupted work are large, enough the paradigm choice is largely amortized by the actual work done.
+
+\section{The \protect\CFA\ Kernel : Processors, Clusters and Threads}\label{kernel}
+A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}. It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues. A \textbf{cfacluster} also offers a pluggable scheduler that can optimize the workload generated by the \textbf{uthread}.
+
+\textbf{cfacluster} have not been fully implemented in the context of this thesis. Currently \CFA only supports one \textbf{cfacluster}, the initial one.
+
+\subsection{Future Work: Machine Setup}\label{machine}
+While this was not done in the context of this thesis, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. For example, a system using \textbf{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certain CPU cores. OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
+
+\subsection{Paradigms}\label{cfaparadigms}
+Given these building blocks, it is possible to reproduce all three of the popular paradigms. Indeed, \textbf{uthread} is the default paradigm in \CFA. However, disabling \textbf{preemption} on the \textbf{cfacluster} means \textbf{cfathread} effectively become \textbf{fiber}. Since several \textbf{cfacluster} with different scheduling policy can coexist in the same application, this allows \textbf{fiber} and \textbf{uthread} to coexist in the runtime of an application. Finally, it is possible to build executors for thread pools from \textbf{uthread} or \textbf{fiber}, which includes specialized jobs like actors~\cite{Actors}.
+
+
+
+\section{Behind the Scenes}
+There are several challenges specific to \CFA when implementing concurrency. These challenges are a direct result of \textbf{bulk-acq} and loose object definitions. These two constraints are the root cause of most design decisions in the implementation. Furthermore, to avoid contention from dynamically allocating memory in a concurrent environment, the internal-scheduling design is (almost) entirely free of mallocs. This approach avoids the chicken and egg problem~\cite{Chicken} of having a memory allocator that relies on the threading system and a threading system that relies on the runtime. This extra goal means that memory management is a constant concern in the design of the system.
+
+The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation. Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers. Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays. Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array. The threads and the condition both have a fixed amount of memory, while \code{mutex} routines and blocking calls allow for an unbound amount, within the stack size.
+
+Note that since the major contributions of this thesis are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
+
+% ======================================================================
+% ======================================================================
+\section{Mutex Routines}
+% ======================================================================
+% ======================================================================
+
+The first step towards the monitor implementation is simple \code{mutex} routines. In the single monitor case, mutual-exclusion is done using the entry/exit procedure in listing \ref{lst:entry1}. The entry/exit procedures do not have to be extended to support multiple monitors. Indeed it is sufficient to enter/leave monitors one-by-one as long as the order is correct to prevent deadlock~\cite{Havender68}. In \CFA, ordering of monitor acquisition relies on memory ordering. This approach is sufficient because all objects are guaranteed to have distinct non-overlapping memory layouts and mutual-exclusion for a monitor is only defined for its lifetime, meaning that destroying a monitor while it is acquired is undefined behaviour. When a mutex call is made, the concerned monitors are aggregated into a variable-length pointer array and sorted based on pointer values. This array persists for the entire duration of the mutual-exclusion and its ordering reused extensively.
+\begin{figure}
+\begin{multicols}{2}
+Entry
+\begin{pseudo}
+if monitor is free
+	enter
+elif already own the monitor
+	continue
+else
+	block
+increment recursions
+\end{pseudo}
+\columnbreak
+Exit
+\begin{pseudo}
+decrement recursion
+if recursion == 0
+	if entry queue not empty
+		wake-up thread
+\end{pseudo}
+\end{multicols}
+\begin{pseudo}[caption={Initial entry and exit routine for monitors},label={lst:entry1}]
+\end{pseudo}
+\end{figure}
+
+\subsection{Details: Interaction with polymorphism}
+Depending on the choice of semantics for when monitor locks are acquired, interaction between monitors and \CFA's concept of polymorphism can be more complex to support. However, it is shown that entry-point locking solves most of the issues.
+
+First of all, interaction between \code{otype} polymorphism (see Section~\ref{s:ParametricPolymorphism}) and monitors is impossible since monitors do not support copying. Therefore, the main question is how to support \code{dtype} polymorphism. It is important to present the difference between the two acquiring options: \textbf{callsite-locking} and entry-point locking, i.e., acquiring the monitors before making a mutex routine-call or as the first operation of the mutex routine-call. For example:
+\begin{table}[H]
+\begin{center}
+\begin{tabular}{|c|c|c|}
+Mutex & \textbf{callsite-locking} & \textbf{entry-point-locking} \\
+call & pseudo-code & pseudo-code \\
+\hline
+\begin{cfacode}[tabsize=3]
+void foo(monitor& mutex a){
+
+	//Do Work
+	//...
+
+}
+
+void main() {
+	monitor a;
+
+	foo(a);
+
+}
+\end{cfacode} & \begin{pseudo}[tabsize=3]
+foo(& a) {
+
+	//Do Work
+	//...
+
+}
+
+main() {
+	monitor a;
+	acquire(a);
+	foo(a);
+	release(a);
+}
+\end{pseudo} & \begin{pseudo}[tabsize=3]
+foo(& a) {
+	acquire(a);
+	//Do Work
+	//...
+	release(a);
+}
+
+main() {
+	monitor a;
+
+	foo(a);
+
+}
+\end{pseudo}
+\end{tabular}
+\end{center}
+\caption{Call-site vs entry-point locking for mutex calls}
+\label{tbl:locking-site}
+\end{table}
+
+Note the \code{mutex} keyword relies on the type system, which means that in cases where a generic monitor-routine is desired, writing the mutex routine is possible with the proper trait, e.g.:
+\begin{cfacode}
+//Incorrect: T may not be monitor
+forall(dtype T)
+void foo(T * mutex t);
+
+//Correct: this function only works on monitors (any monitor)
+forall(dtype T | is_monitor(T))
+void bar(T * mutex t));
+\end{cfacode}
+
+Both entry point and \textbf{callsite-locking} are feasible implementations. The current \CFA implementation uses entry-point locking because it requires less work when using \textbf{raii}, effectively transferring the burden of implementation to object construction/destruction. It is harder to use \textbf{raii} for call-site locking, as it does not necessarily have an existing scope that matches exactly the scope of the mutual exclusion, i.e., the function body. For example, the monitor call can appear in the middle of an expression. Furthermore, entry-point locking requires less code generation since any useful routine is called multiple times but there is only one entry point for many call sites.
+
+% ======================================================================
+% ======================================================================
+\section{Threading} \label{impl:thread}
+% ======================================================================
+% ======================================================================
+
+Figure \ref{fig:system1} shows a high-level picture if the \CFA runtime system in regards to concurrency. Each component of the picture is explained in detail in the flowing sections.
+
+\begin{figure}
+\begin{center}
+{\resizebox{\textwidth}{!}{\input{system.pstex_t}}}
+\end{center}
+\caption{Overview of the entire system}
+\label{fig:system1}
+\end{figure}
+
+\subsection{Processors}
+Parallelism in \CFA is built around using processors to specify how much parallelism is desired. \CFA processors are object wrappers around kernel threads, specifically \texttt{pthread}s in the current implementation of \CFA. Indeed, any parallelism must go through operating-system libraries. However, \textbf{uthread} are still the main source of concurrency, processors are simply the underlying source of parallelism. Indeed, processor \textbf{kthread} simply fetch a \textbf{uthread} from the scheduler and run it; they are effectively executers for user-threads. The main benefit of this approach is that it offers a well-defined boundary between kernel code and user code, for example, kernel thread quiescing, scheduling and interrupt handling. Processors internally use coroutines to take advantage of the existing context-switching semantics.
+
+\subsection{Stack Management}
+One of the challenges of this system is to reduce the footprint as much as possible. Specifically, all \texttt{pthread}s created also have a stack created with them, which should be used as much as possible. Normally, coroutines also create their own stack to run on, however, in the case of the coroutines used for processors, these coroutines run directly on the \textbf{kthread} stack, effectively stealing the processor stack. The exception to this rule is the Main Processor, i.e., the initial \textbf{kthread} that is given to any program. In order to respect C user expectations, the stack of the initial kernel thread, the main stack of the program, is used by the main user thread rather than the main processor, which can grow very large.
+
+\subsection{Context Switching}
+As mentioned in section \ref{coroutine}, coroutines are a stepping stone for implementing threading, because they share the same mechanism for context-switching between different stacks. To improve performance and simplicity, context-switching is implemented using the following assumption: all context-switches happen inside a specific function call. This assumption means that the context-switch only has to copy the callee-saved registers onto the stack and then switch the stack registers with the ones of the target coroutine/thread. Note that the instruction pointer can be left untouched since the context-switch is always inside the same function. Threads, however, do not context-switch between each other directly. They context-switch to the scheduler. This method is called a 2-step context-switch and has the advantage of having a clear distinction between user code and the kernel where scheduling and other system operations happen. Obviously, this doubles the context-switch cost because threads must context-switch to an intermediate stack. The alternative 1-step context-switch uses the stack of the ``from'' thread to schedule and then context-switches directly to the ``to'' thread. However, the performance of the 2-step context-switch is still superior to a \code{pthread_yield} (see section \ref{results}). Additionally, for users in need for optimal performance, it is important to note that having a 2-step context-switch as the default does not prevent \CFA from offering a 1-step context-switch (akin to the Microsoft \code{SwitchToFiber}~\cite{switchToWindows} routine). This option is not currently present in \CFA, but the changes required to add it are strictly additive.
+
+\subsection{Preemption} \label{preemption}
+Finally, an important aspect for any complete threading system is preemption. As mentioned in section \ref{basics}, preemption introduces an extra degree of uncertainty, which enables users to have multiple threads interleave transparently, rather than having to cooperate among threads for proper scheduling and CPU distribution. Indeed, preemption is desirable because it adds a degree of isolation among threads. In a fully cooperative system, any thread that runs a long loop can starve other threads, while in a preemptive system, starvation can still occur but it does not rely on every thread having to yield or block on a regular basis, which reduces significantly a programmer burden. Obviously, preemption is not optimal for every workload. However any preemptive system can become a cooperative system by making the time slices extremely large. Therefore, \CFA uses a preemptive threading system.
+
+Preemption in \CFA\footnote{Note that the implementation of preemption is strongly tied with the underlying threading system. For this reason, only the Linux implementation is cover, \CFA does not run on Windows at the time of writting} is based on kernel timers, which are used to run a discrete-event simulation. Every processor keeps track of the current time and registers an expiration time with the preemption system. When the preemption system receives a change in preemption, it inserts the time in a sorted order and sets a kernel timer for the closest one, effectively stepping through preemption events on each signal sent by the timer. These timers use the Linux signal {\tt SIGALRM}, which is delivered to the process rather than the kernel-thread. This results in an implementation problem, because when delivering signals to a process, the kernel can deliver the signal to any kernel thread for which the signal is not blocked, i.e.:
+\begin{quote}
+A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of the threads has the signal unblocked, then the kernel chooses an arbitrary thread to which to deliver the signal.
+SIGNAL(7) - Linux Programmer's Manual
+\end{quote}
+For the sake of simplicity, and in order to prevent the case of having two threads receiving alarms simultaneously, \CFA programs block the {\tt SIGALRM} signal on every kernel thread except one.
+
+Now because of how involuntary context-switches are handled, the kernel thread handling {\tt SIGALRM} cannot also be a processor thread. Hence, involuntary context-switching is done by sending signal {\tt SIGUSR1} to the corresponding proces\-sor and having the thread yield from inside the signal handler. This approach effectively context-switches away from the signal handler back to the kernel and the signal handler frame is eventually unwound when the thread is scheduled again. As a result, a signal handler can start on one kernel thread and terminate on a second kernel thread (but the same user thread). It is important to note that signal handlers save and restore signal masks because user-thread migration can cause a signal mask to migrate from one kernel thread to another. This behaviour is only a problem if all kernel threads, among which a user thread can migrate, differ in terms of signal masks\footnote{Sadly, official POSIX documentation is silent on what distinguishes ``async-signal-safe'' functions from other functions.}. However, since the kernel thread handling preemption requires a different signal mask, executing user threads on the kernel-alarm thread can cause deadlocks. For this reason, the alarm thread is in a tight loop around a system call to \code{sigwaitinfo}, requiring very little CPU time for preemption. One final detail about the alarm thread is how to wake it when additional communication is required (e.g., on thread termination). This unblocking is also done using {\tt SIGALRM}, but sent through the \code{pthread_sigqueue}. Indeed, \code{sigwait} can differentiate signals sent from \code{pthread_sigqueue} from signals sent from alarms or the kernel.
+
+\subsection{Scheduler}
+Finally, an aspect that was not mentioned yet is the scheduling algorithm. Currently, the \CFA scheduler uses a single ready queue for all processors, which is the simplest approach to scheduling. Further discussion on scheduling is present in section \ref{futur:sched}.
+
+% ======================================================================
+% ======================================================================
+\section{Internal Scheduling} \label{impl:intsched}
+% ======================================================================
+% ======================================================================
+The following figure is the traditional illustration of a monitor (repeated from page~\pageref{fig:ClassicalMonitor} for convenience):
+
+\begin{figure}[H]
+\begin{center}
+{\resizebox{0.4\textwidth}{!}{\input{monitor}}}
+\end{center}
+\caption{Traditional illustration of a monitor}
+\end{figure}
+
+This picture has several components, the two most important being the entry queue and the AS-stack. The entry queue is an (almost) FIFO list where threads waiting to enter are parked, while the acceptor/signaller (AS) stack is a FILO list used for threads that have been signalled or otherwise marked as running next.
+
+For \CFA, this picture does not have support for blocking multiple monitors on a single condition. To support \textbf{bulk-acq} two changes to this picture are required. First, it is no longer helpful to attach the condition to \emph{a single} monitor. Secondly, the thread waiting on the condition has to be separated across multiple monitors, seen in figure \ref{fig:monitor_cfa}.
+
+\begin{figure}[H]
+\begin{center}
+{\resizebox{0.8\textwidth}{!}{\input{int_monitor}}}
+\end{center}
+\caption{Illustration of \CFA Monitor}
+\label{fig:monitor_cfa}
+\end{figure}
+
+This picture and the proper entry and leave algorithms (see listing \ref{lst:entry2}) is the fundamental implementation of internal scheduling. Note that when a thread is moved from the condition to the AS-stack, it is conceptually split into N pieces, where N is the number of monitors specified in the parameter list. The thread is woken up when all the pieces have popped from the AS-stacks and made active. In this picture, the threads are split into halves but this is only because there are two monitors. For a specific signalling operation every monitor needs a piece of thread on its AS-stack.
+
+\begin{figure}[b]
+\begin{multicols}{2}
+Entry
+\begin{pseudo}
+if monitor is free
+	enter
+elif already own the monitor
+	continue
+else
+	block
+increment recursion
+
+\end{pseudo}
+\columnbreak
+Exit
+\begin{pseudo}
+decrement recursion
+if recursion == 0
+	if signal_stack not empty
+		set_owner to thread
+		if all monitors ready
+			wake-up thread
+
+	if entry queue not empty
+		wake-up thread
+\end{pseudo}
+\end{multicols}
+\begin{pseudo}[caption={Entry and exit routine for monitors with internal scheduling},label={lst:entry2}]
+\end{pseudo}
+\end{figure}
+
+The solution discussed in \ref{intsched} can be seen in the exit routine of listing \ref{lst:entry2}. Basically, the solution boils down to having a separate data structure for the condition queue and the AS-stack, and unconditionally transferring ownership of the monitors but only unblocking the thread when the last monitor has transferred ownership. This solution is deadlock safe as well as preventing any potential barging. The data structures used for the AS-stack are reused extensively for external scheduling, but in the case of internal scheduling, the data is allocated using variable-length arrays on the call stack of the \code{wait} and \code{signal_block} routines.
+
+\begin{figure}[H]
+\begin{center}
+{\resizebox{0.8\textwidth}{!}{\input{monitor_structs.pstex_t}}}
+\end{center}
+\caption{Data structures involved in internal/external scheduling}
+\label{fig:structs}
+\end{figure}
+
+Figure \ref{fig:structs} shows a high-level representation of these data structures. The main idea behind them is that, a thread cannot contain an arbitrary number of intrusive ``next'' pointers for linking onto monitors. The \code{condition node} is the data structure that is queued onto a condition variable and, when signalled, the condition queue is popped and each \code{condition criterion} is moved to the AS-stack. Once all the criteria have been popped from their respective AS-stacks, the thread is woken up, which is what is shown in listing \ref{lst:entry2}.
+
+% ======================================================================
+% ======================================================================
+\section{External Scheduling}
+% ======================================================================
+% ======================================================================
+Similarly to internal scheduling, external scheduling for multiple monitors relies on the idea that waiting-thread queues are no longer specific to a single monitor, as mentioned in section \ref{extsched}. For internal scheduling, these queues are part of condition variables, which are still unique for a given scheduling operation (i.e., no signal statement uses multiple conditions). However, in the case of external scheduling, there is no equivalent object which is associated with \code{waitfor} statements. This absence means the queues holding the waiting threads must be stored inside at least one of the monitors that is acquired. These monitors being the only objects that have sufficient lifetime and are available on both sides of the \code{waitfor} statement. This requires an algorithm to choose which monitor holds the relevant queue. It is also important that said algorithm be independent of the order in which users list parameters. The proposed algorithm is to fall back on monitor lock ordering (sorting by address) and specify that the monitor that is acquired first is the one with the relevant waiting queue. This assumes that the lock acquiring order is static for the lifetime of all concerned objects but that is a reasonable constraint.
+
+This algorithm choice has two consequences:
+\begin{itemize}
+	\item The queue of the monitor with the lowest address is no longer a true FIFO queue because threads can be moved to the front of the queue. These queues need to contain a set of monitors for each of the waiting threads. Therefore, another thread whose set contains the same lowest address monitor but different lower priority monitors may arrive first but enter the critical section after a thread with the correct pairing.
+	\item The queue of the lowest priority monitor is both required and potentially unused. Indeed, since it is not known at compile time which monitor is the monitor which has the lowest address, every monitor needs to have the correct queues even though it is possible that some queues go unused for the entire duration of the program, for example if a monitor is only used in a specific pair.
+\end{itemize}
+Therefore, the following modifications need to be made to support external scheduling:
+\begin{itemize}
+	\item The threads waiting on the entry queue need to keep track of which routine they are trying to enter, and using which set of monitors. The \code{mutex} routine already has all the required information on its stack, so the thread only needs to keep a pointer to that information.
+	\item The monitors need to keep a mask of acceptable routines. This mask contains for each acceptable routine, a routine pointer and an array of monitors to go with it. It also needs storage to keep track of which routine was accepted. Since this information is not specific to any monitor, the monitors actually contain a pointer to an integer on the stack of the waiting thread. Note that if a thread has acquired two monitors but executes a \code{waitfor} with only one monitor as a parameter, setting the mask of acceptable routines to both monitors will not cause any problems since the extra monitor will not change ownership regardless. This becomes relevant when \code{when} clauses affect the number of monitors passed to a \code{waitfor} statement.
+	\item The entry/exit routines need to be updated as shown in listing \ref{lst:entry3}.
+\end{itemize}
+
+\subsection{External Scheduling - Destructors}
+Finally, to support the ordering inversion of destructors, the code generation needs to be modified to use a special entry routine. This routine is needed because of the storage requirements of the call order inversion. Indeed, when waiting for the destructors, storage is needed for the waiting context and the lifetime of said storage needs to outlive the waiting operation it is needed for. For regular \code{waitfor} statements, the call stack of the routine itself matches this requirement but it is no longer the case when waiting for the destructor since it is pushed on to the AS-stack for later. The \code{waitfor} semantics can then be adjusted correspondingly, as seen in listing \ref{lst:entry-dtor}
+
+\begin{figure}
+\begin{multicols}{2}
+Entry
+\begin{pseudo}
+if monitor is free
+	enter
+elif already own the monitor
+	continue
+elif matches waitfor mask
+	push criteria to AS-stack
+	continue
+else
+	block
+increment recursion
+\end{pseudo}
+\columnbreak
+Exit
+\begin{pseudo}
+decrement recursion
+if recursion == 0
+	if signal_stack not empty
+		set_owner to thread
+		if all monitors ready
+			wake-up thread
+		endif
+	endif
+
+	if entry queue not empty
+		wake-up thread
+	endif
+\end{pseudo}
+\end{multicols}
+\begin{pseudo}[caption={Entry and exit routine for monitors with internal scheduling and external scheduling},label={lst:entry3}]
+\end{pseudo}
+\end{figure}
+
+\begin{figure}
+\begin{multicols}{2}
+Destructor Entry
+\begin{pseudo}
+if monitor is free
+	enter
+elif already own the monitor
+	increment recursion
+	return
+create wait context
+if matches waitfor mask
+	reset mask
+	push self to AS-stack
+	baton pass
+else
+	wait
+increment recursion
+\end{pseudo}
+\columnbreak
+Waitfor
+\begin{pseudo}
+if matching thread is already there
+	if found destructor
+		push destructor to AS-stack
+		unlock all monitors
+	else
+		push self to AS-stack
+		baton pass
+	endif
+	return
+endif
+if non-blocking
+	Unlock all monitors
+	Return
+endif
+
+push self to AS-stack
+set waitfor mask
+block
+return
+\end{pseudo}
+\end{multicols}
+\begin{pseudo}[caption={Pseudo code for the \code{waitfor} routine and the \code{mutex} entry routine for destructors},label={lst:entry-dtor}]
+\end{pseudo}
+\end{figure}
+
+
+% ======================================================================
+% ======================================================================
+\section{Putting It All Together}
+% ======================================================================
+% ======================================================================
+
+
+\section{Threads As Monitors}
+As it was subtly alluded in section \ref{threads}, \code{thread}s in \CFA are in fact monitors, which means that all monitor features are available when using threads. For example, here is a very simple two thread pipeline that could be used for a simulator of a game engine:
+\begin{figure}[H]
+\begin{cfacode}[caption={Toy simulator using \code{thread}s and \code{monitor}s.},label={lst:engine-v1}]
+// Visualization declaration
+thread Renderer {} renderer;
+Frame * simulate( Simulator & this );
+
+// Simulation declaration
+thread Simulator{} simulator;
+void render( Renderer & this );
+
+// Blocking call used as communication
+void draw( Renderer & mutex this, Frame * frame );
+
+// Simulation loop
+void main( Simulator & this ) {
+	while( true ) {
+		Frame * frame = simulate( this );
+		draw( renderer, frame );
+	}
+}
+
+// Rendering loop
+void main( Renderer & this ) {
+	while( true ) {
+		waitfor( draw, this );
+		render( this );
+	}
+}
+\end{cfacode}
+\end{figure}
+One of the obvious complaints of the previous code snippet (other than its toy-like simplicity) is that it does not handle exit conditions and just goes on forever. Luckily, the monitor semantics can also be used to clearly enforce a shutdown order in a concise manner:
+\begin{figure}[H]
+\begin{cfacode}[caption={Same toy simulator with proper termination condition.},label={lst:engine-v2}]
+// Visualization declaration
+thread Renderer {} renderer;
+Frame * simulate( Simulator & this );
+
+// Simulation declaration
+thread Simulator{} simulator;
+void render( Renderer & this );
+
+// Blocking call used as communication
+void draw( Renderer & mutex this, Frame * frame );
+
+// Simulation loop
+void main( Simulator & this ) {
+	while( true ) {
+		Frame * frame = simulate( this );
+		draw( renderer, frame );
+
+		// Exit main loop after the last frame
+		if( frame->is_last ) break;
+	}
+}
+
+// Rendering loop
+void main( Renderer & this ) {
+	while( true ) {
+		   waitfor( draw, this );
+		or waitfor( ^?{}, this ) {
+			// Add an exit condition
+			break;
+		}
+
+		render( this );
+	}
+}
+
+// Call destructor for simulator once simulator finishes
+// Call destructor for renderer to signify shutdown
+\end{cfacode}
+\end{figure}
+
+\section{Fibers \& Threads}
+As mentioned in section \ref{preemption}, \CFA uses preemptive threads by default but can use fibers on demand. Currently, using fibers is done by adding the following line of code to the program~:
+\begin{cfacode}
+unsigned int default_preemption() {
+	return 0;
+}
+\end{cfacode}
+This function is called by the kernel to fetch the default preemption rate, where 0 signifies an infinite time-slice, i.e., no preemption. However, once clusters are fully implemented, it will be possible to create fibers and \textbf{uthread} in the same system, as in listing \ref{lst:fiber-uthread}
+\begin{figure}
+\begin{cfacode}[caption={Using fibers and \textbf{uthread} side-by-side in \CFA},label={lst:fiber-uthread}]
+//Cluster forward declaration
+struct cluster;
+
+//Processor forward declaration
+struct processor;
+
+//Construct clusters with a preemption rate
+void ?{}(cluster& this, unsigned int rate);
+//Construct processor and add it to cluster
+void ?{}(processor& this, cluster& cluster);
+//Construct thread and schedule it on cluster
+void ?{}(thread& this, cluster& cluster);
+
+//Declare two clusters
+cluster thread_cluster = { 10`ms };			//Preempt every 10 ms
+cluster fibers_cluster = { 0 };				//Never preempt
+
+//Construct 4 processors
+processor processors[4] = {
+	//2 for the thread cluster
+	thread_cluster;
+	thread_cluster;
+	//2 for the fibers cluster
+	fibers_cluster;
+	fibers_cluster;
+};
+
+//Declares thread
+thread UThread {};
+void ?{}(UThread& this) {
+	//Construct underlying thread to automatically
+	//be scheduled on the thread cluster
+	(this){ thread_cluster }
+}
+
+void main(UThread & this);
+
+//Declares fibers
+thread Fiber {};
+void ?{}(Fiber& this) {
+	//Construct underlying thread to automatically
+	//be scheduled on the fiber cluster
+	(this.__thread){ fibers_cluster }
+}
+
+void main(Fiber & this);
+\end{cfacode}
+\end{figure}
+
+
+% ======================================================================
+% ======================================================================
+\section{Performance Results} \label{results}
+% ======================================================================
+% ======================================================================
+\section{Machine Setup}
+Table \ref{tab:machine} shows the characteristics of the machine used to run the benchmarks. All tests were made on this machine.
+\begin{table}[H]
+\begin{center}
+\begin{tabular}{| l | r | l | r |}
+\hline
+Architecture		& x86\_64 			& NUMA node(s) 	& 8 \\
+\hline
+CPU op-mode(s)		& 32-bit, 64-bit 		& Model name 	& AMD Opteron\texttrademark  Processor 6380 \\
+\hline
+Byte Order			& Little Endian 		& CPU Freq 		& 2.5\si{\giga\hertz} \\
+\hline
+CPU(s)			& 64 				& L1d cache 	& \SI{16}{\kibi\byte} \\
+\hline
+Thread(s) per core	& 2 				& L1i cache 	& \SI{64}{\kibi\byte} \\
+\hline
+Core(s) per socket	& 8 				& L2 cache 		& \SI{2048}{\kibi\byte} \\
+\hline
+Socket(s)			& 4 				& L3 cache 		& \SI{6144}{\kibi\byte} \\
+\hline
+\hline
+Operating system		& Ubuntu 16.04.3 LTS	& Kernel		& Linux 4.4-97-generic \\
+\hline
+Compiler			& GCC 6.3 		& Translator	& CFA 1 \\
+\hline
+Java version		& OpenJDK-9 		& Go version	& 1.9.2 \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Machine setup used for the tests}
+\label{tab:machine}
+\end{table}
+
+\section{Micro Benchmarks}
+All benchmarks are run using the same harness to produce the results, seen as the \code{BENCH()} macro in the following examples. This macro uses the following logic to benchmark the code:
+\begin{pseudo}
+#define BENCH(run, result) \
+	before = gettime(); \
+	run; \
+	after  = gettime(); \
+	result = (after - before) / N;
+\end{pseudo}
+The method used to get time is \code{clock_gettime(CLOCK_THREAD_CPUTIME_ID);}. Each benchmark is using many iterations of a simple call to measure the cost of the call. The specific number of iterations depends on the specific benchmark.
+
+\subsection{Context-Switching}
+The first interesting benchmark is to measure how long context-switches take. The simplest approach to do this is to yield on a thread, which executes a 2-step context switch. Yielding causes the thread to context-switch to the scheduler and back, more precisely: from the \textbf{uthread} to the \textbf{kthread} then from the \textbf{kthread} back to the same \textbf{uthread} (or a different one in the general case). In order to make the comparison fair, coroutines also execute a 2-step context-switch by resuming another coroutine which does nothing but suspending in a tight loop, which is a resume/suspend cycle instead of a yield. Listing \ref{lst:ctx-switch} shows the code for coroutines and threads with the results in table \ref{tab:ctx-switch}. All omitted tests are functionally identical to one of these tests. The difference between coroutines and threads can be attributed to the cost of scheduling.
+\begin{figure}
+\begin{multicols}{2}
+\CFA Coroutines
+\begin{cfacode}
+coroutine GreatSuspender {};
+void main(GreatSuspender& this) {
+	while(true) { suspend(); }
+}
+int main() {
+	GreatSuspender s;
+	resume(s);
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			resume(s);
+		},
+		result
+	)
+	printf("%llu\n", result);
+}
+\end{cfacode}
+\columnbreak
+\CFA Threads
+\begin{cfacode}
+
+
+
+
+int main() {
+
+
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			yield();
+		},
+		result
+	)
+	printf("%llu\n", result);
+}
+\end{cfacode}
+\end{multicols}
+\begin{cfacode}[caption={\CFA benchmark code used to measure context-switches for coroutines and threads.},label={lst:ctx-switch}]
+\end{cfacode}
+\end{figure}
+
+\begin{table}
+\begin{center}
+\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
+\cline{2-4}
+\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
+\hline
+Kernel Thread	& 241.5	& 243.86	& 5.08 \\
+\CFA Coroutine	& 38		& 38		& 0    \\
+\CFA Thread		& 103		& 102.96	& 2.96 \\
+\uC Coroutine	& 46		& 45.86	& 0.35 \\
+\uC Thread		& 98		& 99.11	& 1.42 \\
+Goroutine		& 150		& 149.96	& 3.16 \\
+Java Thread		& 289		& 290.68	& 8.72 \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Context Switch comparison. All numbers are in nanoseconds(\si{\nano\second})}
+\label{tab:ctx-switch}
+\end{table}
+
+\subsection{Mutual-Exclusion}
+The next interesting benchmark is to measure the overhead to enter/leave a critical-section. For monitors, the simplest approach is to measure how long it takes to enter and leave a monitor routine. Listing \ref{lst:mutex} shows the code for \CFA. To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a \code{pthread_mutex} lock is also measured. The results can be shown in table \ref{tab:mutex}.
+
+\begin{figure}
+\begin{cfacode}[caption={\CFA benchmark code used to measure mutex routines.},label={lst:mutex}]
+monitor M {};
+void __attribute__((noinline)) call( M & mutex m /*, m2, m3, m4*/ ) {}
+
+int main() {
+	M m/*, m2, m3, m4*/;
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			call(m/*, m2, m3, m4*/);
+		},
+		result
+	)
+	printf("%llu\n", result);
+}
+\end{cfacode}
+\end{figure}
+
+\begin{table}
+\begin{center}
+\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
+\cline{2-4}
+\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
+\hline
+C routine						& 2		& 2		& 0    \\
+FetchAdd + FetchSub				& 26		& 26		& 0    \\
+Pthreads Mutex Lock				& 31		& 31.86	& 0.99 \\
+\uC \code{monitor} member routine		& 30		& 30		& 0    \\
+\CFA \code{mutex} routine, 1 argument	& 41		& 41.57	& 0.9  \\
+\CFA \code{mutex} routine, 2 argument	& 76		& 76.96	& 1.57 \\
+\CFA \code{mutex} routine, 4 argument	& 145		& 146.68	& 3.85 \\
+Java synchronized routine			& 27		& 28.57	& 2.6  \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Mutex routine comparison. All numbers are in nanoseconds(\si{\nano\second})}
+\label{tab:mutex}
+\end{table}
+
+\subsection{Internal Scheduling}
+The internal-scheduling benchmark measures the cost of waiting on and signalling a condition variable. Listing \ref{lst:int-sched} shows the code for \CFA, with results table \ref{tab:int-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
+
+\begin{figure}
+\begin{cfacode}[caption={Benchmark code for internal scheduling},label={lst:int-sched}]
+volatile int go = 0;
+condition c;
+monitor M {};
+M m1;
+
+void __attribute__((noinline)) do_call( M & mutex a1 ) { signal(c); }
+
+thread T {};
+void ^?{}( T & mutex this ) {}
+void main( T & this ) {
+	while(go == 0) { yield(); }
+	while(go == 1) { do_call(m1); }
+}
+int  __attribute__((noinline)) do_wait( M & mutex a1 ) {
+	go = 1;
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			wait(c);
+		},
+		result
+	)
+	printf("%llu\n", result);
+	go = 0;
+	return 0;
+}
+int main() {
+	T t;
+	return do_wait(m1);
+}
+\end{cfacode}
+\end{figure}
+
+\begin{table}
+\begin{center}
+\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
+\cline{2-4}
+\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
+\hline
+Pthreads Condition Variable			& 5902.5	& 6093.29 	& 714.78 \\
+\uC \code{signal}					& 322		& 323 	& 3.36   \\
+\CFA \code{signal}, 1 \code{monitor}	& 352.5	& 353.11	& 3.66   \\
+\CFA \code{signal}, 2 \code{monitor}	& 430		& 430.29	& 8.97   \\
+\CFA \code{signal}, 4 \code{monitor}	& 594.5	& 606.57	& 18.33  \\
+Java \code{notify}				& 13831.5	& 15698.21	& 4782.3 \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Internal scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
+\label{tab:int-sched}
+\end{table}
+
+\subsection{External Scheduling}
+The Internal scheduling benchmark measures the cost of the \code{waitfor} statement (\code{_Accept} in \uC). Listing \ref{lst:ext-sched} shows the code for \CFA, with results in table \ref{tab:ext-sched}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests.
+
+\begin{figure}
+\begin{cfacode}[caption={Benchmark code for external scheduling},label={lst:ext-sched}]
+volatile int go = 0;
+monitor M {};
+M m1;
+thread T {};
+
+void __attribute__((noinline)) do_call( M & mutex a1 ) {}
+
+void ^?{}( T & mutex this ) {}
+void main( T & this ) {
+	while(go == 0) { yield(); }
+	while(go == 1) { do_call(m1); }
+}
+int  __attribute__((noinline)) do_wait( M & mutex a1 ) {
+	go = 1;
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			waitfor(call, a1);
+		},
+		result
+	)
+	printf("%llu\n", result);
+	go = 0;
+	return 0;
+}
+int main() {
+	T t;
+	return do_wait(m1);
+}
+\end{cfacode}
+\end{figure}
+
+\begin{table}
+\begin{center}
+\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
+\cline{2-4}
+\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
+\hline
+\uC \code{Accept}					& 350		& 350.61	& 3.11  \\
+\CFA \code{waitfor}, 1 \code{monitor}	& 358.5	& 358.36	& 3.82  \\
+\CFA \code{waitfor}, 2 \code{monitor}	& 422		& 426.79	& 7.95  \\
+\CFA \code{waitfor}, 4 \code{monitor}	& 579.5	& 585.46	& 11.25 \\
+\hline
+\end{tabular}
+\end{center}
+\caption{External scheduling comparison. All numbers are in nanoseconds(\si{\nano\second})}
+\label{tab:ext-sched}
+\end{table}
+
+\subsection{Object Creation}
+Finally, the last benchmark measures the cost of creation for concurrent objects. Listing \ref{lst:creation} shows the code for \texttt{pthread}s and \CFA threads, with results shown in table \ref{tab:creation}. As with all other benchmarks, all omitted tests are functionally identical to one of these tests. The only note here is that the call stacks of \CFA coroutines are lazily created, therefore without priming the coroutine, the creation cost is very low.
+
+\begin{figure}
+\begin{center}
+\texttt{pthread}
+\begin{ccode}
+int main() {
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			pthread_t thread;
+			if(pthread_create(&thread,NULL,foo,NULL)<0) {
+				perror( "failure" );
+				return 1;
+			}
+
+			if(pthread_join(thread, NULL)<0) {
+				perror( "failure" );
+				return 1;
+			}
+		},
+		result
+	)
+	printf("%llu\n", result);
+}
+\end{ccode}
+
+
+
+\CFA Threads
+\begin{cfacode}
+int main() {
+	BENCH(
+		for(size_t i=0; i<n; i++) {
+			MyThread m;
+		},
+		result
+	)
+	printf("%llu\n", result);
+}
+\end{cfacode}
+\end{center}
+\begin{cfacode}[caption={Benchmark code for \texttt{pthread}s and \CFA to measure object creation},label={lst:creation}]
+\end{cfacode}
+\end{figure}
+
+\begin{table}
+\begin{center}
+\begin{tabular}{| l | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] | S[table-format=5.2,table-number-alignment=right] |}
+\cline{2-4}
+\multicolumn{1}{c |}{} & \multicolumn{1}{c |}{ Median } &\multicolumn{1}{c |}{ Average } & \multicolumn{1}{c |}{ Standard Deviation} \\
+\hline
+Pthreads			& 26996	& 26984.71	& 156.6  \\
+\CFA Coroutine Lazy	& 6		& 5.71	& 0.45   \\
+\CFA Coroutine Eager	& 708		& 706.68	& 4.82   \\
+\CFA Thread			& 1173.5	& 1176.18	& 15.18  \\
+\uC Coroutine		& 109		& 107.46	& 1.74   \\
+\uC Thread			& 526		& 530.89	& 9.73   \\
+Goroutine			& 2520.5	& 2530.93	& 61,56  \\
+Java Thread			& 91114.5	& 92272.79	& 961.58 \\
+\hline
+\end{tabular}
+\end{center}
+\caption{Creation comparison. All numbers are in nanoseconds(\si{\nano\second}).}
+\label{tab:creation}
+\end{table}
+
+
+
+\section{Conclusion}
+This thesis has achieved a minimal concurrency \textbf{api} that is simple, efficient and usable as the basis for higher-level features. The approach presented is based on a lightweight thread-system for parallelism, which sits on top of clusters of processors. This M:N model is judged to be both more efficient and allow more flexibility for users. Furthermore, this document introduces monitors as the main concurrency tool for users. This thesis also offers a novel approach allowing multiple monitors to be accessed simultaneously without running into the Nested Monitor Problem~\cite{Lister77}. It also offers a full implementation of the concurrency runtime written entirely in \CFA, effectively the largest \CFA code base to date.
+
+
+% ======================================================================
+% ======================================================================
+\section{Future Work}
+% ======================================================================
+% ======================================================================
+
+\subsection{Performance} \label{futur:perf}
+This thesis presents a first implementation of the \CFA concurrency runtime. Therefore, there is still significant work to improve performance. Many of the data structures and algorithms may change in the future to more efficient versions. For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. It may be possible that limiting the number helps increase performance. However, it is not obvious that the benefit would be significant.
+
+\subsection{Flexible Scheduling} \label{futur:sched}
+An important part of concurrency is scheduling. Different scheduling algorithms can affect performance (both in terms of average and variation). However, no single scheduler is optimal for all workloads and therefore there is value in being able to change the scheduler for given programs. One solution is to offer various tweaking options to users, allowing the scheduler to be adjusted to the requirements of the workload. However, in order to be truly flexible, it would be interesting to allow users to add arbitrary data and arbitrary scheduling algorithms. For example, a web server could attach Type-of-Service information to threads and have a ``ToS aware'' scheduling algorithm tailored to this specific web server. This path of flexible schedulers will be explored for \CFA.
+
+\subsection{Non-Blocking I/O} \label{futur:nbio}
+While most of the parallelism tools are aimed at data parallelism and control-flow parallelism, many modern workloads are not bound on computation but on IO operations, a common case being web servers and XaaS (anything as a service). These types of workloads often require significant engineering around amortizing costs of blocking IO operations. At its core, non-blocking I/O is an operating system level feature that allows queuing IO operations (e.g., network operations) and registering for notifications instead of waiting for requests to complete. In this context, the role of the language makes Non-Blocking IO easily available and with low overhead. The current trend is to use asynchronous programming using tools like callbacks and/or futures and promises, which can be seen in frameworks like Node.js~\cite{NodeJs} for JavaScript, Spring MVC~\cite{SpringMVC} for Java and Django~\cite{Django} for Python. However, while these are valid solutions, they lead to code that is harder to read and maintain because it is much less linear.
+
+\subsection{Other Concurrency Tools} \label{futur:tools}
+While monitors offer a flexible and powerful concurrent core for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package. Examples of such tools can include simple locks and condition variables, futures and promises~\cite{promises}, executors and actors. These additional features are useful when monitors offer a level of abstraction that is inadequate for certain tasks.
+
+\subsection{Implicit Threading} \label{futur:implcit}
+Simpler applications can benefit greatly from having implicit parallelism. That is, parallelism that does not rely on the user to write concurrency. This type of parallelism can be achieved both at the language level and at the library level. The canonical example of implicit parallelism is parallel for loops, which are the simplest example of a divide and conquer algorithms~\cite{uC++book}. Table \ref{lst:parfor} shows three different code examples that accomplish point-wise sums of large arrays. Note that none of these examples explicitly declare any concurrency or parallelism objects.
+
+\begin{table}
+\begin{center}
+\begin{tabular}[t]{|c|c|c|}
+Sequential & Library Parallel & Language Parallel \\
+\begin{cfacode}[tabsize=3]
+void big_sum(
+	int* a, int* b,
+	int* o,
+	size_t len)
+{
+	for(
+		int i = 0;
+		i < len;
+		++i )
+	{
+		o[i]=a[i]+b[i];
+	}
+}
+
+
+
+
+
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a & b
+big_sum(a,b,c,10000);
+\end{cfacode} &\begin{cfacode}[tabsize=3]
+void big_sum(
+	int* a, int* b,
+	int* o,
+	size_t len)
+{
+	range ar(a, a+len);
+	range br(b, b+len);
+	range or(o, o+len);
+	parfor( ai, bi, oi,
+	[](	int* ai,
+		int* bi,
+		int* oi)
+	{
+		oi=ai+bi;
+	});
+}
+
+
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a & b
+big_sum(a,b,c,10000);
+\end{cfacode}&\begin{cfacode}[tabsize=3]
+void big_sum(
+	int* a, int* b,
+	int* o,
+	size_t len)
+{
+	parfor (ai,bi,oi)
+	    in (a, b, o )
+	{
+		oi = ai + bi;
+	}
+}
+
+
+
+
+
+
+
+int* a[10000];
+int* b[10000];
+int* c[10000];
+//... fill in a & b
+big_sum(a,b,c,10000);
+\end{cfacode}
+\end{tabular}
+\end{center}
+\caption{For loop to sum numbers: Sequential, using library parallelism and language parallelism.}
+\label{lst:parfor}
+\end{table}
+
+Implicit parallelism is a restrictive solution and therefore has its limitations. However, it is a quick and simple approach to parallelism, which may very well be sufficient for smaller applications and reduces the amount of boilerplate needed to start benefiting from parallelism in modern CPUs.
+
+
+% A C K N O W L E D G E M E N T S
+% -------------------------------
+\section{Acknowledgements}
+
+I would like to thank my supervisor, Professor Peter Buhr, for his guidance through my degree as well as the editing of this document.
+
+I would like to thank Professors Martin Karsten and Gregor Richards, for reading my thesis and providing helpful feedback.
+
+Thanks to Aaron Moss, Rob Schluntz and Andrew Beach for their work on the \CFA project as well as all the discussions which have helped me concretize the ideas in this thesis.
+
+Finally, I acknowledge that this has been possible thanks to the financial help offered by the David R. Cheriton School of Computer Science and the corporate partnership with Huawei Ltd.
+
+
+% B I B L I O G R A P H Y
+% -----------------------------
+\bibliographystyle{plain}
+\bibliography{pl,local}
+
+\end{document}
Index: doc/papers/concurrency/annex/local.bib
===================================================================
--- doc/papers/concurrency/annex/local.bib	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/annex/local.bib	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,150 @@
+%    Predefined journal names:
+%  acmcs: Computing Surveys		acta: Acta Infomatica
+%  cacm: Communications of the ACM
+%  ibmjrd: IBM J. Research & Development ibmsj: IBM Systems Journal
+%  ieeese: IEEE Trans. on Soft. Eng.	ieeetc: IEEE Trans. on Computers
+%  ieeetcad: IEEE Trans. on Computer-Aided Design of Integrated Circuits
+%  ipl: Information Processing Letters	jacm: Journal of the ACM
+%  jcss: J. Computer & System Sciences	scp: Science of Comp. Programming
+%  sicomp: SIAM J. on Computing		tocs: ACM Trans. on Comp. Systems
+%  tods: ACM Trans. on Database Sys.	tog: ACM Trans. on Graphics
+%  toms: ACM Trans. on Math. Software	toois: ACM Trans. on Office Info. Sys.
+%  toplas: ACM Trans. on Prog. Lang. & Sys.
+%  tcs: Theoretical Computer Science
+@string{ieeepds="IEEE Transactions on Parallel and Distributed Systems"}
+@string{ieeese="IEEE Transactions on Software Engineering"}
+@string{spe="Software---\-Practice and Experience"}
+@string{sigplan="SIGPLAN Notices"}
+@string{joop="Journal of Object-Oriented Programming"}
+@string{popl="Conference Record of the ACM Symposium on Principles of Programming Languages"}
+@string{osr="Operating Systems Review"}
+@string{pldi="Programming Language Design and Implementation"}
+
+
+@article{HPP:Study,
+	keywords 	= {Parallel, Productivity},
+	author 	= {Lorin Hochstein and Jeff Carver and Forrest Shull and Sima Asgari and Victor Basili and Jeffrey K. Hollingsworth and Marvin V. Zelkowitz },
+	title 	= {Parallel Programmer Productivity: A Case Study of Novice Parallel Programmers},
+}
+
+@article{Chicken,
+	keywords	= {Chicken},
+	author	= {Doug Zongker},
+	title		= {Chicken Chicken Chicken: Chicken Chicken},
+	year		= 2006
+}
+
+@article{TBB,
+	key	= {TBB},
+	keywords 	= {Intel, TBB},
+	title 	= {Intel Thread Building Blocks},
+	note		= "\url{https://www.threadingbuildingblocks.org/}"
+}
+
+@manual{www-cfa,
+	key	= {CFA},
+	keywords 	= {Cforall},
+	author	= {C$\forall$},
+	title 	= {C$\forall$ Programmming Language},
+	note	= {\url{https://plg.uwaterloo.ca/~cforall}},
+}
+
+@mastersthesis{rob-thesis,
+	keywords 	= {Constructors, Destructors, Tuples},
+	author	= {Rob Schluntz},
+	title 	= {Resource Management and Tuples in Cforall},
+	year		= 2017,
+	school	= {University of Waterloo},
+	note	= {\url{https://uwspace.uwaterloo.ca/handle/10012/11830}},
+}
+
+@manual{Cpp-Transactions,
+	keywords	= {C++, Transactional Memory},
+	title		= {Technical Specification for C++ Extensions for Transactional Memory},
+	organization= {International Standard ISO/IEC TS 19841:2015 },
+	publisher   = {American National Standards Institute},
+	address	= {http://www.iso.org},
+	year		= 2015,
+}
+
+@article{BankTransfer,
+	key	= {Bank Transfer},
+	keywords 	= {Bank Transfer},
+	title 	= {Bank Account Transfer Problem},
+	publisher	= {Wiki Wiki Web},
+	address	= {http://wiki.c2.com},
+	year		= 2010
+}
+
+@misc{2FTwoHardThings,
+	keywords 	= {Hard Problem},
+	title 	= {TwoHardThings},
+	author	= {Martin Fowler},
+	howpublished= "\url{https://martinfowler.com/bliki/TwoHardThings.html}",
+	year		= 2009
+}
+
+@article{IntrusiveData,
+	title		= {Intrusive Data Structures},
+	author	= {Jiri Soukup},
+	journal	= {CppReport},
+	year		= 1998,
+	month		= May,
+	volume	= {10/No5.},
+	page		= 22
+}
+
+@article{Herlihy93,
+	author	= {Herlihy, Maurice and Moss, J. Eliot B.},
+	title	= {Transactional memory: architectural support for lock-free data structures},
+	journal	= {SIGARCH Comput. Archit. News},
+	issue_date	= {May 1993},
+	volume	= {21},
+	number	= {2},
+	month	= may,
+	year	= {1993},
+	pages	= {289--300},
+	numpages	= {12},
+	publisher	= {ACM},
+	address	= {New York, NY, USA},
+}
+
+@manual{affinityLinux,
+	key	= {TBB},
+	title		= "{Linux man page - sched\_setaffinity(2)}"
+}
+
+@manual{affinityWindows,
+	title		= "{Windows (vs.85) - SetThreadAffinityMask function}"
+}
+
+@manual{switchToWindows,
+	title		= "{Windows (vs.85) - SwitchToFiber function}"
+}
+
+@manual{affinityFreebsd,
+	title		= "{FreeBSD General Commands Manual - CPUSET(1)}"
+}
+
+@manual{affinityNetbsd,
+	title		= "{NetBSD Library Functions Manual - AFFINITY(3)}"
+}
+
+@manual{affinityMacosx,
+	title		= "{Affinity API Release Notes for OS X v10.5}"
+}
+
+@misc{NodeJs,
+	title		= "{Node.js}",
+	howpublished= "\url{https://nodejs.org/en/}",
+}
+
+@misc{SpringMVC,
+	title		= "{Spring Web MVC}",
+	howpublished= "\url{https://docs.spring.io/spring/docs/current/spring-framework-reference/web.html}",
+}
+
+@misc{Django,
+	title		= "{Django}",
+	howpublished= "\url{https://www.djangoproject.com/}",
+}
Index: doc/papers/concurrency/build/bump_ver.sh
===================================================================
--- doc/papers/concurrency/build/bump_ver.sh	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/build/bump_ver.sh	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,6 @@
+#!/bin/bash
+if [ ! -f version ]; then
+    echo "0.0.0" > version
+fi
+
+sed -r 's/([0-9]+\.[0-9]+.)([0-9]+)/echo "\1\$((\2+1))" > version/ge' version > /dev/null
Index: doc/papers/concurrency/figures/dependency.fig
===================================================================
--- doc/papers/concurrency/figures/dependency.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/dependency.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,119 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+6 750 2250 2250 2850
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 1050 2550 300 300 750 2550 1350 2550
+4 0 0 50 -1 0 20 0.0000 2 315 1305 900 2700 $\\alpha$3\001
+-6
+6 750 1350 2250 1950
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 1050 1650 300 300 750 1650 1350 1650
+4 0 0 50 -1 0 20 0.0000 2 315 1305 900 1800 $\\alpha$2\001
+-6
+6 750 450 2250 1050
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 1050 750 300 300 750 750 1350 750
+4 0 0 50 -1 0 20 0.0000 2 315 1305 900 900 $\\alpha$1\001
+-6
+6 750 3150 2250 3750
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 1050 3450 300 300 750 3450 1350 3450
+4 0 0 50 -1 0 20 0.0000 2 315 1305 900 3600 $\\alpha$4\001
+-6
+6 750 4050 2250 4650
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 1050 4350 300 300 750 4350 1350 4350
+4 0 0 50 -1 0 20 0.0000 2 315 1305 900 4500 $\\alpha$5\001
+-6
+6 3000 1350 4800 1950
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 1650 300 300 3000 1650 3600 1650
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 1800 $\\gamma$2\001
+-6
+6 3000 450 4800 1050
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 750 300 300 3000 750 3600 750
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 900 $\\gamma$1\001
+-6
+6 3000 2250 4800 2850
+6 3000 2250 3600 2850
+6 3000 2250 3600 2850
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 2550 300 300 3000 2550 3600 2550
+-6
+-6
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 2700 $\\gamma$3\001
+-6
+6 3000 3150 4800 3750
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 3450 300 300 3000 3450 3600 3450
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 3600 $\\gamma$4\001
+-6
+6 3000 4050 4800 4650
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 4350 300 300 3000 4350 3600 4350
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 4500 $\\gamma$5\001
+-6
+6 3000 4950 4800 5550
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 3300 5250 300 300 3000 5250 3600 5250
+4 0 0 50 -1 0 20 0.0000 2 315 1560 3150 5400 $\\gamma$6\001
+-6
+6 5400 1800 6750 4200
+6 5400 1800 6750 2400
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 5700 2100 300 300 5400 2100 6000 2100
+4 0 0 50 -1 0 20 0.0000 2 270 1140 5550 2250 $\\beta$1\001
+-6
+6 5400 2700 6750 3300
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 5700 3000 300 300 5400 3000 6000 3000
+4 0 0 50 -1 0 20 0.0000 2 270 1140 5550 3150 $\\beta$2\001
+-6
+6 5400 3600 6750 4200
+1 4 0 1 0 7 50 -1 -1 0.000 1 0.0000 5700 3900 300 300 5400 3900 6000 3900
+4 0 0 50 -1 0 20 0.0000 2 270 1140 5550 4050 $\\beta$3\001
+-6
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 5700 2700 5700 2400
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 5700 3600 5700 3300
+-6
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 1050 1350 1050 1050
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 3300 1350 3300 1050
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 3300 2250 3300 1950
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 1050 2250 1050 1950
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 1050 3150 1050 2850
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 3300 3150 3300 2850
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 1050 4050 1050 3750
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 3300 4050 3300 3750
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 3300 4950 3300 4650
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 0 2
+	1 1 1.00 60.00 120.00
+	 1350 2550 3000 2550
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 1 2
+	1 1 1.00 60.00 120.00
+	 1350 3450 3000 3450
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 1 2
+	1 1 1.00 60.00 120.00
+	 3000 5175 1350 4500
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 1 2
+	1 1 1.00 60.00 120.00
+	 5462 4060 3582 5156
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 1 2
+	1 1 1.00 60.00 120.00
+	 3564 4198 5438 3144
Index: doc/papers/concurrency/figures/ext_monitor.fig
===================================================================
--- doc/papers/concurrency/figures/ext_monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/ext_monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,96 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 3150.000 3450.000 3150 3150 2850 3450 3150 3750
+5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 3150.000 4350.000 3150 4050 2850 4350 3150 4650
+6 5850 1950 6150 2250
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6000 2100 105 105 6000 2100 6105 2205
+4 1 -1 0 0 0 10 0.0000 2 105 90 6000 2160 d\001
+-6
+6 5100 2100 5400 2400
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 5250 2250 105 105 5250 2250 5355 2250
+4 1 -1 0 0 0 10 0.0000 2 105 120 5250 2295 X\001
+-6
+6 5100 1800 5400 2100
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 5250 1950 105 105 5250 1950 5355 1950
+4 1 -1 0 0 0 10 0.0000 2 105 120 5250 2010 Y\001
+-6
+6 5850 1650 6150 1950
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6000 1800 105 105 6000 1800 6105 1905
+4 1 -1 0 0 0 10 0.0000 2 105 90 6000 1860 b\001
+-6
+6 3070 5445 7275 5655
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3150 5550 80 80 3150 5550 3230 5630
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4500 5550 105 105 4500 5550 4605 5655
+1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 6000 5550 105 105 6000 5550 6105 5655
+4 0 -1 0 0 0 12 0.0000 2 135 1035 4725 5625 blocked task\001
+4 0 -1 0 0 0 12 0.0000 2 135 870 3300 5625 active task\001
+4 0 -1 0 0 0 12 0.0000 2 135 1050 6225 5625 routine mask\001
+-6
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3300 3600 105 105 3300 3600 3405 3705
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 3600 3600 105 105 3600 3600 3705 3705
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6600 3900 105 105 6600 3900 6705 4005
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6900 3900 105 105 6900 3900 7005 4005
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6000 2700 105 105 6000 2700 6105 2805
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6000 2400 105 105 6000 2400 6105 2505
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 5100 4575 80 80 5100 4575 5180 4655
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 4050 2925 5475 2925 5475 3225 4050 3225 4050 2925
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 3150 3750 3750 3750 3750 4050 3150 4050
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3
+	 3150 3450 3750 3450 3900 3675
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 3750 3150 3600 3375
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3
+	 3150 4350 3750 4350 3900 4575
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 3750 4050 3600 4275
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 3150 4650 3750 4650 3750 4950 4950 4950
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 6450 3750 6300 3975
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 4950 4950 5175 5100
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9
+	 5250 4950 6450 4950 6450 4050 7050 4050 7050 3750 6450 3750
+	 6450 2850 6150 2850 6150 1650
+2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5
+	 5850 4200 5850 3300 4350 3300 4350 4200 5850 4200
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 60.00 120.00
+	7 1 1.00 60.00 120.00
+	 5250 3150 5250 2400
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3150 3150 3750 3150 3750 2850 5700 2850 5700 1650
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 5700 2850 6150 3000
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5100 1800 5400 1800 5400 2400 5100 2400 5100 1800
+4 1 -1 0 0 0 10 0.0000 2 75 75 6000 2745 a\001
+4 1 -1 0 0 0 10 0.0000 2 75 75 6000 2445 c\001
+4 1 -1 0 0 0 12 0.0000 2 135 315 5100 5325 exit\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 3300 3075 A\001
+4 1 -1 0 0 0 12 0.0000 2 135 795 3300 4875 condition\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 3300 5100 B\001
+4 0 -1 0 0 0 12 0.0000 2 135 420 6600 3675 stack\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 6600 3225 acceptor/\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 6600 3450 signalled\001
+4 1 -1 0 0 0 12 0.0000 2 135 795 3300 2850 condition\001
+4 1 -1 0 0 0 12 0.0000 2 165 420 6000 1350 entry\001
+4 1 -1 0 0 0 12 0.0000 2 135 495 6000 1575 queue\001
+4 0 -1 0 0 0 12 0.0000 2 135 525 6300 2400 arrival\001
+4 0 -1 0 0 0 12 0.0000 2 135 630 6300 2175 order of\001
+4 1 -1 0 0 0 12 0.0000 2 135 525 5100 3675 shared\001
+4 1 -1 0 0 0 12 0.0000 2 135 735 5100 3975 variables\001
+4 0 0 50 -1 0 11 0.0000 2 165 855 4275 3150 Acceptables\001
+4 0 0 50 -1 0 11 0.0000 2 120 165 5775 2700 W\001
+4 0 0 50 -1 0 11 0.0000 2 120 135 5775 2400 X\001
+4 0 0 50 -1 0 11 0.0000 2 120 105 5775 2100 Z\001
+4 0 0 50 -1 0 11 0.0000 2 120 135 5775 1800 Y\001
Index: doc/papers/concurrency/figures/int_monitor.fig
===================================================================
--- doc/papers/concurrency/figures/int_monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/int_monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,109 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+5 1 0 1 0 7 50 -1 -1 0.000 0 1 0 0 675.000 2700.000 675 2400 375 2700 675 3000
+6 4533 2866 4655 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 1 0 0 4657.017 2997.000 4655 2873 4533 2997 4655 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 4655 2866 4655 3129
+-6
+6 4725 2866 4847 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 1 0 0 4849.017 2997.000 4847 2873 4725 2997 4847 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 4847 2866 4847 3129
+-6
+6 4911 2866 5033 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 1 0 0 5035.017 2997.000 5033 2873 4911 2997 5033 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 5033 2866 5033 3129
+-6
+6 9027 2866 9149 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 9024.983 2997.000 9027 2873 9149 2997 9027 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 9027 2866 9027 3129
+-6
+6 9253 2866 9375 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 9250.983 2997.000 9253 2873 9375 2997 9253 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 9253 2866 9253 3129
+-6
+6 9478 2866 9600 3129
+5 1 0 1 0 7 50 -1 -1 0.000 0 0 0 0 9475.983 2997.000 9478 2873 9600 2997 9478 3121
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 9478 2866 9478 3129
+-6
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 7650 3675 80 80 7650 3675 7730 3755
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3150 3675 80 80 3150 3675 3230 3755
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4047 1793 125 125 4047 1793 3929 1752
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 4050 1500 125 125 4050 1500 3932 1459
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 8550 1500 125 125 8550 1500 8432 1459
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 8550 1800 125 125 8550 1800 8432 1759
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 1200 2850 125 125 1200 2850 1082 2809
+1 3 0 1 0 7 50 -1 -1 0.000 1 0.0000 900 2850 125 125 900 2850 782 2809
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 6000 4650 105 105 6000 4650 6105 4755
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3900 4650 80 80 3900 4650 3980 4730
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3900 1950 4200 2100
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 5
+	 3000 4050 1800 4050 1800 1950 3900 1950 3900 1350
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9
+	 7800 4050 9000 4050 9000 3150 9600 3150 9600 2850 9000 2850
+	 9000 1950 8700 1950 8700 1350
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 8400 1950 8700 2100
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9
+	 3300 4050 4500 4050 4500 3150 5100 3150 5100 2850 4500 2850
+	 4500 1950 4200 1950 4200 1350
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 5
+	 7500 4050 6300 4050 6300 1950 8400 1950 8400 1350
+2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5
+	 8400 3300 8400 2400 6900 2400 6900 3300 8400 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 9000 2850 8850 3150
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 7500 4050 7800 4200
+2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5
+	 3900 3300 3900 2400 2400 2400 2400 3300 3900 3300
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 4500 2850 4350 3150
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 3000 4050 3300 4200
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 675 3000 1425 3000
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 675 2400 1425 2400
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 1425 2700 1500 2925
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 1425 2400 1350 2625
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 2
+	 675 2700 1425 2700
+4 1 -1 0 0 0 12 0.0000 2 135 315 2850 4275 exit\001
+4 1 -1 0 0 0 12 0.0000 2 135 315 7350 4275 exit\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 9150 2325 acceptor/\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 9150 2550 signalled\001
+4 0 -1 0 0 0 12 0.0000 2 135 420 9150 2775 stack\001
+4 1 -1 0 0 0 12 0.0000 2 135 525 7650 2775 shared\001
+4 1 -1 0 0 0 12 0.0000 2 135 735 7650 3075 variables\001
+4 1 -1 0 0 0 12 0.0000 2 135 495 8550 1275 queue\001
+4 1 -1 0 0 0 12 0.0000 2 165 420 8550 1125 entry\001
+4 0 -1 0 0 0 12 0.0000 2 135 630 8850 1575 order of\001
+4 0 -1 0 0 0 12 0.0000 2 135 525 8850 1725 arrival\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 4650 2325 acceptor/\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 4650 2550 signalled\001
+4 0 -1 0 0 0 12 0.0000 2 135 420 4650 2775 stack\001
+4 1 -1 0 0 0 12 0.0000 2 135 525 3150 2775 shared\001
+4 1 -1 0 0 0 12 0.0000 2 135 735 3150 3075 variables\001
+4 0 -1 0 0 0 12 0.0000 2 135 525 4350 1725 arrival\001
+4 0 -1 0 0 0 12 0.0000 2 135 630 4350 1500 order of\001
+4 1 -1 0 0 0 12 0.0000 2 135 495 4050 1275 queue\001
+4 1 -1 0 0 0 12 0.0000 2 165 420 4050 1050 entry\001
+4 0 0 50 -1 0 11 0.0000 2 120 705 600 2325 Condition\001
+4 0 -1 0 0 0 12 0.0000 2 135 1215 6150 4725 blocked thread\001
+4 0 -1 0 0 0 12 0.0000 2 135 1050 4050 4725 active thread\001
Index: doc/papers/concurrency/figures/monitor.fig
===================================================================
--- doc/papers/concurrency/figures/monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/monitor.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,101 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 2700.000 1500 2400 1200 2700 1500 3000
+5 1 0 1 -1 -1 0 0 -1 0.000 0 1 0 0 1500.000 3600.000 1500 3300 1200 3600 1500 3900
+6 4200 1200 4500 1500
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1350 105 105 4350 1350 4455 1455
+4 1 -1 0 0 0 10 0.0000 2 105 90 4350 1410 d\001
+-6
+6 4200 900 4500 1200
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1050 105 105 4350 1050 4455 1155
+4 1 -1 0 0 0 10 0.0000 2 105 90 4350 1110 b\001
+-6
+6 2400 1500 2700 1800
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 1650 105 105 2550 1650 2655 1650
+4 1 -1 0 0 0 10 0.0000 2 105 90 2550 1710 b\001
+-6
+6 2400 1800 2700 2100
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 2550 1950 105 105 2550 1950 2655 1950
+4 1 -1 0 0 0 10 0.0000 2 75 75 2550 1995 a\001
+-6
+6 3300 1500 3600 1800
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 1650 105 105 3450 1650 3555 1650
+4 1 -1 0 0 0 10 0.0000 2 105 90 3450 1710 d\001
+-6
+6 1350 4650 5325 4950
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 1500 4800 80 80 1500 4800 1580 4880
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 2850 4800 105 105 2850 4800 2955 4905
+1 3 0 1 -1 -1 0 0 4 0.000 1 0.0000 4350 4800 105 105 4350 4800 4455 4905
+4 0 -1 0 0 0 12 0.0000 2 180 765 4575 4875 duplicate\001
+4 0 -1 0 0 0 12 0.0000 2 135 1035 3075 4875 blocked task\001
+4 0 -1 0 0 0 12 0.0000 2 135 870 1650 4875 active task\001
+-6
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1650 2850 105 105 1650 2850 1755 2955
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 1950 2850 105 105 1950 2850 2055 2955
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4950 3150 105 105 4950 3150 5055 3255
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 5250 3150 105 105 5250 3150 5355 3255
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1950 105 105 4350 1950 4455 2055
+1 3 0 1 -1 -1 0 0 -1 0.000 1 0.0000 4350 1650 105 105 4350 1650 4455 1755
+1 3 0 1 -1 -1 0 0 20 0.000 1 0.0000 3450 3825 80 80 3450 3825 3530 3905
+1 3 0 1 -1 -1 1 0 4 0.000 1 0.0000 3450 1950 105 105 3450 1950 3555 1950
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 2400 2100 2625 2250
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 3300 2100 3525 2250
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 4200 2100 4425 2250
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 5
+	 1500 2400 2100 2400 2100 2100 2400 2100 2400 1500
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 1500 3000 2100 3000 2100 3300 1500 3300
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3
+	 1500 2700 2100 2700 2250 2925
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 2100 2400 1950 2625
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 3
+	 1500 3600 2100 3600 2250 3825
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 2100 3300 1950 3525
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 1500 3900 2100 3900 2100 4200 3300 4200
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 4800 3000 4650 3225
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 2
+	 3300 4200 3525 4350
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 3600 1500 3600 2100 4200 2100 4200 900
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 4
+	 2700 1500 2700 2100 3300 2100 3300 1500
+2 1 0 1 -1 -1 0 0 -1 0.000 0 0 -1 0 0 9
+	 3600 4200 4800 4200 4800 3300 5400 3300 5400 3000 4800 3000
+	 4800 2100 4500 2100 4500 900
+2 2 1 1 -1 -1 0 0 -1 4.000 0 0 0 0 0 5
+	 4200 3450 4200 2550 2700 2550 2700 3450 4200 3450
+4 1 -1 0 0 0 10 0.0000 2 75 75 4350 1995 a\001
+4 1 -1 0 0 0 10 0.0000 2 75 75 4350 1695 c\001
+4 1 -1 0 0 0 12 0.0000 2 135 315 3450 4575 exit\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 1650 2325 A\001
+4 1 -1 0 0 0 12 0.0000 2 135 795 1650 4125 condition\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 1650 4350 B\001
+4 0 -1 0 0 0 12 0.0000 2 135 420 4950 2925 stack\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 4950 2475 acceptor/\001
+4 0 -1 0 0 0 12 0.0000 2 180 750 4950 2700 signalled\001
+4 1 -1 0 0 0 12 0.0000 2 135 795 1650 2100 condition\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 2550 1425 X\001
+4 1 -1 0 0 0 12 0.0000 2 135 135 3450 1425 Y\001
+4 1 -1 0 0 0 12 0.0000 2 165 420 4350 600 entry\001
+4 1 -1 0 0 0 12 0.0000 2 135 495 4350 825 queue\001
+4 0 -1 0 0 0 12 0.0000 2 135 525 4650 1650 arrival\001
+4 0 -1 0 0 0 12 0.0000 2 135 630 4650 1425 order of\001
+4 1 -1 0 0 0 12 0.0000 2 135 525 3450 2925 shared\001
+4 1 -1 0 0 0 12 0.0000 2 135 735 3450 3225 variables\001
+4 1 -1 0 0 0 12 0.0000 2 120 510 3000 975 mutex\001
+4 1 -1 0 0 0 10 0.0000 2 75 75 3450 1995 c\001
+4 1 -1 0 0 0 12 0.0000 2 135 570 3000 1200 queues\001
Index: doc/papers/concurrency/figures/monitor_structs.fig
===================================================================
--- doc/papers/concurrency/figures/monitor_structs.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/monitor_structs.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,71 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1500 1200 2100 1200 2100 1500 1500 1500 1500 1200
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1500 1500 2100 1500 2100 1800 1500 1800 1500 1500
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3000 1200 3300 1200 3300 1500 3000 1500 3000 1200
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3000 1500 3300 1500 3300 1800 3000 1800 3000 1500
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3000 1800 3300 1800 3300 2100 3000 2100 3000 1800
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 3000 2100 3300 2100 3300 2400 3000 2400 3000 2100
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 1500 900 2100 900 2100 1200 1500 1200 1500 900
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 90.00 120.00
+	5 1 1.00 45.00 90.00
+	 1800 1050 4050 1050
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 5100 900 5700 900 5700 1800 5100 1800 5100 900
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 6900 1500 7500 1500 7500 2400 6900 2400 6900 1500
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 6000 1200 6600 1200 6600 2100 6000 2100 6000 1200
+2 2 0 1 0 7 50 -1 -1 0.000 0 0 -1 0 0 5
+	 7800 1800 8400 1800 8400 2700 7800 2700 7800 1800
+2 1 0 1 0 7 50 -1 -1 0.000 0 0 -1 1 1 2
+	1 1 1.00 90.00 120.00
+	5 1 1.00 45.00 90.00
+	 1800 1350 3000 1350
+3 2 0 3 0 7 50 -1 -1 0.000 1 0 0 10
+	 4275 900 4050 975 4350 1050 4050 1125 4350 1200 4050 1275
+	 4350 1350 4050 1425 4350 1500 4125 1575
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+3 2 0 1 0 7 50 -1 -1 0.000 0 1 1 3
+	1 1 1.00 90.00 120.00
+	5 1 1.00 30.00 90.00
+	 3150 1950 4875 2400 6900 1650
+	 0.000 -1.000 0.000
+3 2 0 1 0 7 50 -1 -1 0.000 0 1 1 3
+	1 1 1.00 90.00 120.00
+	5 1 1.00 60.00 90.00
+	 3150 1350 4200 1650 5100 1050
+	 0.000 -1.000 0.000
+3 2 0 1 0 7 50 -1 -1 0.000 0 1 1 3
+	1 1 1.00 90.00 120.00
+	5 1 1.00 60.00 90.00
+	 3150 1650 4575 2025 6000 1350
+	 0.000 -1.000 0.000
+3 2 0 1 0 7 50 -1 -1 0.000 0 1 1 3
+	1 1 1.00 90.00 120.00
+	5 1 1.00 60.00 90.00
+	 3150 2250 5175 2775 7800 1950
+	 0.000 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 705 3000 675 Condition\001
+4 0 0 50 -1 0 11 0.0000 2 120 630 3000 885 Criterion\001
+4 0 0 50 -1 0 11 0.0000 2 120 705 1425 675 Condition\001
+4 0 0 50 -1 0 11 0.0000 2 120 390 1425 825 Node\001
+4 0 0 50 -1 0 11 0.0000 2 120 660 6225 675 Monitors\001
+4 0 0 50 -1 0 11 0.0000 2 165 555 3900 675 Waiting\001
+4 0 0 50 -1 0 11 0.0000 2 120 495 3900 825 Thread\001
Index: doc/papers/concurrency/figures/system.fig
===================================================================
--- doc/papers/concurrency/figures/system.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/figures/system.fig	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,166 @@
+#FIG 3.2  Produced by xfig version 3.2.5c
+Landscape
+Center
+Inches
+Letter  
+100.00
+Single
+-2
+1200 2
+6 5175 2700 6150 3737
+3 2 0 4 0 7 49 -1 -1 0.000 1 0 0 10
+	 5475 2702 5625 2777 5325 2852 5625 2927 5325 3002 5625 3077
+	 5325 3152 5625 3227 5325 3302 5475 3377
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 885 5175 3737 Processor N\001
+4 0 0 50 -1 0 11 0.0000 2 120 975 5175 3527 PThread N+2\001
+-6
+6 3300 2700 4140 3737
+3 2 0 4 0 7 49 -1 -1 0.000 1 0 0 10
+	 3600 2702 3750 2777 3450 2852 3750 2927 3450 3002 3750 3077
+	 3450 3152 3750 3227 3450 3302 3600 3377
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 840 3300 3737 Processor 0\001
+4 0 0 50 -1 0 11 0.0000 2 120 735 3300 3527 PThread 2\001
+-6
+6 600 2700 1725 3737
+3 2 0 4 0 7 49 -1 -1 0.000 1 0 0 10
+	 900 2702 1050 2777 750 2852 1050 2927 750 3002 1050 3077
+	 750 3152 1050 3227 750 3302 900 3377
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 1125 600 3737 Main Processor\001
+4 0 0 50 -1 0 11 0.0000 2 120 735 600 3527 PThread 0\001
+-6
+6 2100 2700 2835 3737
+3 2 0 4 0 7 49 -1 -1 0.000 1 0 0 10
+	 2400 2702 2550 2777 2250 2852 2550 2927 2250 3002 2550 3077
+	 2250 3152 2550 3227 2250 3302 2400 3377
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 450 2100 3737 Alarm\001
+4 0 0 50 -1 0 11 0.0000 2 120 735 2100 3527 PThread 1\001
+-6
+6 600 6301 1290 7367
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 900 6302 1050 6377 750 6452 1050 6527 750 6602 1050 6677
+	 750 6752 1050 6827 750 6902 900 6977
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 150 690 600 7337 int main()\001
+4 0 0 50 -1 0 11 0.0000 2 120 570 600 7127 thread 0\001
+-6
+6 1635 6300 2205 7336
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 1935 6301 2085 6376 1785 6451 2085 6526 1785 6601 2085 6676
+	 1785 6751 2085 6826 1785 6901 1935 6976
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 570 1635 7126 thread 1\001
+-6
+6 2475 6300 3045 7336
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 2775 6301 2925 6376 2625 6451 2925 6526 2625 6601 2925 6676
+	 2625 6751 2925 6826 2625 6901 2775 6976
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 570 2475 7126 thread 2\001
+-6
+6 3300 6300 3870 7336
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 3600 6301 3750 6376 3450 6451 3750 6526 3450 6601 3750 6676
+	 3450 6751 3750 6826 3450 6901 3600 6976
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 570 3300 7126 thread 3\001
+-6
+6 5325 6300 5970 7336
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 5625 6301 5775 6376 5475 6451 5775 6526 5475 6601 5775 6676
+	 5475 6751 5775 6826 5475 6901 5625 6976
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 645 5325 7126 thread M\001
+-6
+6 4125 6300 4695 7336
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 4425 6301 4575 6376 4275 6451 4575 6526 4275 6601 4575 6676
+	 4275 6751 4575 6826 4275 6901 4425 6976
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 570 4125 7126 thread 4\001
+-6
+6 6975 4050 9525 7875
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 7125 5400 7575 5400 7575 5850 7125 5850 7125 5400
+2 2 0 1 0 7 50 -1 18 0.000 0 1 -1 0 0 5
+	 7125 4200 7575 4200 7575 4650 7125 4650 7125 4200
+2 2 0 1 0 7 50 -1 45 0.000 0 1 -1 0 0 5
+	 7125 4800 7575 4800 7575 5250 7125 5250 7125 4800
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 6975 4050 9525 4050 9525 7875 6975 7875 6975 4050
+3 2 0 2 0 7 49 -1 -1 0.000 1 0 0 10
+	 7350 6900 7500 6975 7200 7050 7500 7125 7200 7200 7500 7275
+	 7200 7350 7500 7425 7200 7500 7350 7575
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+3 2 0 4 0 7 49 -1 -1 0.000 1 0 0 10
+	 7350 6000 7500 6075 7200 6150 7500 6225 7200 6300 7500 6375
+	 7200 6450 7500 6525 7200 6600 7350 6675
+	 0.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000 -1.000
+	 -1.000 0.000
+4 0 0 50 -1 0 11 0.0000 2 120 945 7725 4500 Pthread stack\001
+4 0 0 50 -1 0 11 0.0000 2 150 1530 7725 5100 Pthread stack (stolen)\001
+4 0 0 50 -1 0 11 0.0000 2 120 540 7725 6375 Pthread\001
+4 0 0 50 -1 0 11 0.0000 2 150 1065 7725 7275 $\\CFA$ thread\001
+4 0 0 50 -1 0 11 0.0000 2 150 990 7725 5700 $\\CFA$ stack\001
+-6
+1 2 0 1 0 7 50 -1 -1 0.000 1 3.1416 3150 5250 750 450 2400 4800 3900 5700
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 1200 3900 2475 5025
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 3600 3900 3450 4800
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 5550 3900 3825 5025
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 900 6225 2400 5400
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 2100 6225 2625 5550
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 2850 6225 3000 5700
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 3600 6225 3375 5700
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 4350 6300 3675 5625
+2 1 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 2
+	 5625 6225 3900 5400
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 525 975 1275 975 1275 2625 525 2625 525 975
+2 2 0 1 0 7 50 -1 45 0.000 0 1 -1 0 0 5
+	 3225 975 3975 975 3975 2625 3225 2625 3225 975
+2 2 0 1 0 7 50 -1 45 0.000 0 1 -1 0 0 5
+	 5100 975 5850 975 5850 2625 5100 2625 5100 975
+2 2 0 1 0 7 50 -1 45 0.000 0 1 -1 0 0 5
+	 525 7425 1275 7425 1275 9075 525 9075 525 7425
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 1575 7425 2325 7425 2325 9075 1575 9075 1575 7425
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 2400 7425 3150 7425 3150 9075 2400 9075 2400 7425
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 3225 7425 3975 7425 3975 9075 3225 9075 3225 7425
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 4050 7425 4800 7425 4800 9075 4050 9075 4050 7425
+2 2 0 1 0 7 50 -1 -1 0.000 0 1 -1 0 0 5
+	 5250 7425 6000 7425 6000 9075 5250 9075 5250 7425
+2 1 1 8 0 7 50 -1 -1 4.000 0 0 -1 1 0 2
+	1 1 2.00 180.00 75.00
+	 2400 3900 2775 4800
+2 2 0 1 0 7 50 -1 18 0.000 0 1 -1 0 0 5
+	 2025 2625 2775 2625 2775 975 2025 975 2025 2625
+4 0 0 50 -1 0 18 0.0000 2 30 225 4500 3150 ...\001
+4 0 0 50 -1 0 18 0.0000 2 30 225 3750 4500 ...\001
+4 0 0 50 -1 0 11 0.0000 2 120 705 2775 5325 Scheduler\001
+4 0 0 50 -1 0 18 0.0000 2 30 225 4950 6600 ...\001
+4 0 0 50 -1 0 18 0.0000 2 30 225 4200 5850 ...\001
Index: doc/papers/concurrency/notes/cor-thread-traits.c
===================================================================
--- doc/papers/concurrency/notes/cor-thread-traits.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/notes/cor-thread-traits.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,90 @@
+//-----------------------------------------------------------------------------
+// Coroutine trait
+// Anything that implements this trait can be resumed.
+// Anything that is resumed is a coroutine.
+trait is_coroutine(dtype T) {
+      void main(T* this);
+      coroutine_handle* get_handle(T* this);
+}
+
+//-----------------------------------------------------------------------------
+forall(dtype T | {coroutine_handle* T.c})
+coroutine_handle* get_handle(T* this) {
+	return this->c
+}
+
+//-----------------------------------------------------------------------------
+struct myCoroutine {
+	int bla;
+	coroutine_handle c;
+};
+
+void main(myCoroutine* this) {
+	sout | this->bla | endl;
+}
+
+void foo() {
+	//Run the coroutine
+	myCoroutine myc;
+	resume(myc);
+}
+
+//-----------------------------------------------------------------------------
+// Thread trait
+// Alternative 1
+trait is_thread(dtype T) { 
+      void main(T* this);
+      thread_handle* get_handle(T* this);
+	thread T;
+};
+
+//-----------------------------------------------------------------------------
+forall(dtype T | {thread_handle* T.t})
+thread_handle* get_handle(T* this) {
+	return this->t
+}
+
+//-----------------------------------------------------------------------------
+thread myThread {
+	int bla;
+	thread_handle c;
+};
+
+void main(myThread* this) {
+	sout | this->bla | endl;
+}
+
+void foo() {
+	//Run the thread
+	myThread myc;
+}
+
+//-----------------------------------------------------------------------------
+// Thread trait
+// Alternative 2
+trait is_thread(dtype T) {
+      void main(T* this);
+      thread_handle* get_handle(T* this);
+	
+};
+
+//-----------------------------------------------------------------------------
+forall(dtype T | {thread_handle* T.t})
+thread_handle* get_handle(T* this) {
+	return this->t
+}
+
+//-----------------------------------------------------------------------------
+struct myThread {
+	int bla;
+	thread_handle c;
+};
+
+void main(myThread* this) {
+	sout | this->bla | endl;
+}
+
+void foo() {
+	//Run the thread
+	thread(myThread) myc;
+}
Index: doc/papers/concurrency/notes/lit-review.md
===================================================================
--- doc/papers/concurrency/notes/lit-review.md	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/notes/lit-review.md	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,25 @@
+lit review :
+
+Lister77 : nested monitor calls
+	- explains the problem
+	- no solution
+	- Lister : An implementation of monitors.
+	- Lister : Hierarchical monitors.
+
+Haddon77 : Nested monitor calls
+	- monitors should be release before acquiring a new one.
+
+Horst Wettstein : The problem of nested monitor calls revisited
+	- Solves nested monitor by allowing barging
+
+David L. Parnas : The non problem of nesied monitor calls
+	- not an actual problem in real life
+
+M. Joseph and VoR. Prasad : More on nested monitor call
+	- WTF... don't use monitors, use pure classes instead, whatever that is
+
+Joseph et al, 1978). 
+
+Toby bloom : Evaluating Synchronization Mechanisms
+	- Methods to evaluate concurrency
+
Index: doc/papers/concurrency/notes/notes.md
===================================================================
--- doc/papers/concurrency/notes/notes.md	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/notes/notes.md	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,14 @@
+Internal scheduling notes.
+
+Internal scheduling requires a stack or queue to make sense.
+We also need a stack of "monitor contexts" to be able to restuore stuff.
+
+Multi scheduling try 1 
+ - adding threads to many monitors and synching the monitors
+ - Too hard
+
+Multi scheduling try 2
+ - using a leader when in a group
+ - it's hard but doable to manage who is the leader and keep the current context
+ - basically __monitor_guard_t always saves an restore the leader and current context
+ 
Index: doc/papers/concurrency/style/cfa-format.tex
===================================================================
--- doc/papers/concurrency/style/cfa-format.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/style/cfa-format.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,279 @@
+\usepackage[usenames,dvipsnames]{xcolor}
+\usepackage{listings}
+\usepackage{inconsolata}
+
+\definecolor{basicCol}{HTML}{000000}
+\definecolor{commentCol}{HTML}{000000}
+\definecolor{stringCol}{HTML}{000000}
+\definecolor{keywordCol}{HTML}{000000}
+\definecolor{identifierCol}{HTML}{000000}
+
+% from https://gist.github.com/nikolajquorning/92bbbeef32e1dd80105c9bf2daceb89a
+\lstdefinelanguage{sml} {
+  morekeywords= {
+    EQUAL, GREATER, LESS, NONE, SOME, abstraction, abstype, and, andalso, array, as, before, bool, case, char, datatype, do, else, end, eqtype, exception, exn, false, fn, fun, functor, handle, if, in, include, infix, infixr, int, let, list, local, nil, nonfix, not, o, of, op, open, option, orelse, overload, print, raise, real, rec, ref, sharing, sig, signature, string, struct, structure, substring, then, true, type, unit, val, vector, where, while, with, withtype, word
+  },
+  morestring=[b]",
+  morecomment=[s]{(*}{*)},
+}
+
+\lstdefinelanguage{D}{
+  % Keywords
+  morekeywords=[1]{
+    abstract, alias, align, auto, body, break, cast, catch, class, const,
+    continue, debug, delegate, delete, deprecated, do, else, enum, export,
+    false, final, finally, for, foreach, foreach_reverse, function, goto, if,
+    immutable, import, in, inout, interface, invariant, is, lazy, macro, mixin,
+    module, new, nothrow, null, out, override, package, pragma, private,
+    protected, public, pure, ref, return, shared, static, struct, super,
+    switch, synchronized, template, this, throw, true, try, typedef, typeid,
+    typeof, union, unittest, volatile, while, with
+  },
+  % Special identifiers, common functions
+  morekeywords=[2]{enforce},
+  % Ugly identifiers
+  morekeywords=[3]{
+    __DATE__, __EOF__, __FILE__, __LINE__, __TIMESTAMP__, __TIME__, __VENDOR__,
+    __VERSION__, __ctfe, __gshared, __monitor, __thread, __vptr, _argptr,
+    _arguments, _ctor, _dtor
+  },
+  % Basic types
+  morekeywords=[4]{
+     byte, ubyte, short, ushort, int, uint, long, ulong, cent, ucent, void,
+     bool, bit, float, double, real, ushort, int, uint, long, ulong, float,
+     char, wchar, dchar, string, wstring, dstring, ireal, ifloat, idouble,
+     creal, cfloat, cdouble, size_t, ptrdiff_t, sizediff_t, equals_t, hash_t
+  },
+  % Strings
+  morestring=[b]{"},
+  morestring=[b]{'},
+  morestring=[b]{`},
+  % Comments
+  comment=[l]{//},
+  morecomment=[s]{/*}{*/},
+  morecomment=[s][\color{blue}]{/**}{*/},
+  morecomment=[n]{/+}{+/},
+  morecomment=[n][\color{blue}]{/++}{+/},
+  % Options
+  sensitive=true
+}
+
+\lstdefinelanguage{rust}{
+  % Keywords
+  morekeywords=[1]{
+    abstract, alignof, as, become, box,
+    break, const, continue, crate, do,
+    else, enum, extern, false, final,
+    fn, for, if, impl, in,
+    let, loop, macro, match, mod,
+    move, mut, offsetof, override, priv,
+    proc, pub, pure, ref, return,
+    Self, self, sizeof, static, struct,
+    super, trait, true,  type, typeof,
+    unsafe, unsized, use, virtual, where,
+    while, yield
+  },
+  % Strings
+  morestring=[b]{"},
+  % Comments
+  comment=[l]{//},
+  morecomment=[s]{/*}{*/},
+  % Options
+  sensitive=true
+}
+
+\lstdefinelanguage{pseudo}{
+	morekeywords={string,uint,int,bool,float},%
+	sensitive=true,%
+	morecomment=[l]{//},%
+	morecomment=[s]{/*}{*/},%
+	morestring=[b]',%
+	morestring=[b]",%
+	morestring=[s]{`}{`},%
+}%
+
+\newcommand{\KWC}{K-W C\xspace}
+
+\lstdefinestyle{pseudoStyle}{
+  escapeinside={@@},
+  basicstyle=\linespread{0.9}\sf\footnotesize,		% reduce line spacing and use typewriter font
+  keywordstyle=\bfseries\color{blue},
+  keywordstyle=[2]\bfseries\color{Plum},
+  commentstyle=\itshape\color{OliveGreen},		    % green and italic comments
+  identifierstyle=\color{identifierCol},
+  stringstyle=\sf\color{Mahogany},			          % use sanserif font
+  mathescape=true,
+  columns=fixed,
+  aboveskip=4pt,                                  % spacing above/below code block
+  belowskip=3pt,
+  keepspaces=true,
+  tabsize=4,
+  % frame=lines,
+  literate=,
+  showlines=true,                                 % show blank lines at end of code
+  showspaces=false,
+  showstringspaces=false,
+  escapechar=\$,
+  xleftmargin=\parindentlnth,                     % indent code to paragraph indentation
+  moredelim=[is][\color{red}\bfseries]{**R**}{**R**},    % red highlighting
+  % moredelim=* detects keywords, comments, strings, and other delimiters and applies their formatting
+  % moredelim=** allows cumulative application
+}
+
+\lstdefinestyle{defaultStyle}{
+  escapeinside={@@},
+  basicstyle=\linespread{0.9}\tt\footnotesize,		% reduce line spacing and use typewriter font
+  keywordstyle=\bfseries\color{blue},
+  keywordstyle=[2]\bfseries\color{Plum},
+  commentstyle=\itshape\color{OliveGreen},		    % green and italic comments
+  identifierstyle=\color{identifierCol},
+  stringstyle=\sf\color{Mahogany},			          % use sanserif font
+  mathescape=true,
+  columns=fixed,
+  aboveskip=4pt,                                  % spacing above/below code block
+  belowskip=3pt,
+  keepspaces=true,
+  tabsize=4,
+  % frame=lines,
+  literate=,
+  showlines=true,                                 % show blank lines at end of code
+  showspaces=false,
+  showstringspaces=false,
+  escapechar=\$,
+  xleftmargin=\parindentlnth,                     % indent code to paragraph indentation
+  moredelim=[is][\color{red}\bfseries]{**R**}{**R**},    % red highlighting
+  % moredelim=* detects keywords, comments, strings, and other delimiters and applies their formatting
+  % moredelim=** allows cumulative application
+}
+
+\lstdefinestyle{cfaStyle}{
+  escapeinside={@@},
+  basicstyle=\linespread{0.9}\tt\footnotesize,		% reduce line spacing and use typewriter font
+  keywordstyle=\bfseries\color{blue},
+  keywordstyle=[2]\bfseries\color{Plum},
+  commentstyle=\sf\itshape\color{OliveGreen},		  % green and italic comments
+  identifierstyle=\color{identifierCol},
+  stringstyle=\sf\color{Mahogany},			          % use sanserif font
+  mathescape=true,
+  columns=fixed,
+  aboveskip=4pt,                                  % spacing above/below code block
+  belowskip=3pt,
+  keepspaces=true,
+  tabsize=4,
+  % frame=lines,
+  literate=,
+  showlines=true,                                 % show blank lines at end of code
+  showspaces=false,
+  showstringspaces=false,
+  escapechar=\$,
+  xleftmargin=\parindentlnth,                     % indent code to paragraph indentation
+  moredelim=[is][\color{red}\bfseries]{**R**}{**R**},    % red highlighting
+  morekeywords=[2]{accept, signal, signal_block, wait, waitfor},
+}
+
+\lstMakeShortInline[basewidth=0.5em,breaklines=true,basicstyle=\normalsize\ttfamily\color{basicCol}]@  % single-character for \lstinline
+
+\lstnewenvironment{ccode}[1][]{
+  \lstset{
+    language = C,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{cfacode}[1][]{
+  \lstset{
+    language = CFA,
+    style=cfaStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{pseudo}[1][]{
+  \lstset{
+    language = pseudo,
+    style=pseudoStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{cppcode}[1][]{
+  \lstset{
+    language = c++,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{ucppcode}[1][]{
+  \lstset{
+    language = c++,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{javacode}[1][]{
+  \lstset{
+    language = java,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{scalacode}[1][]{
+  \lstset{
+    language = scala,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{smlcode}[1][]{
+  \lstset{
+    language = sml,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{dcode}[1][]{
+  \lstset{
+    language = D,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{rustcode}[1][]{
+  \lstset{
+    language = rust,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\lstnewenvironment{gocode}[1][]{
+  \lstset{
+    language = Golang,
+    style=defaultStyle,
+    captionpos=b,
+    #1
+  }
+}{}
+
+\newcommand{\zero}{\lstinline{zero_t}\xspace}
+\newcommand{\one}{\lstinline{one_t}\xspace}
+\newcommand{\ateq}{\lstinline{\@=}\xspace}
+\newcommand{\code}[1]{\lstinline[language=CFA,style=cfaStyle]{#1}}
+\newcommand{\pscode}[1]{\lstinline[language=pseudo,style=pseudoStyle]{#1}}
Index: doc/papers/concurrency/style/style.tex
===================================================================
--- doc/papers/concurrency/style/style.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/style/style.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,12 @@
+\input{common}                                          % bespoke macros used in the document
+\input{cfa-format}
+
+% \CFADefaultStyle
+
+% \lstset{
+% morekeywords=[2]{nomutex,mutex,thread,wait,wait_release,signal,signal_block,accept,monitor,suspend,resume,coroutine},
+% keywordstyle=[2]\color{blue},				% second set of keywords for concurency
+% basicstyle=\linespread{0.9}\tt\small,		% reduce line spacing and use typewriter font
+% stringstyle=\sf\color{Mahogany},			% use sanserif font
+% commentstyle=\itshape\color{OliveGreen},		% green and italic comments
+% }%
Index: doc/papers/concurrency/version
===================================================================
--- doc/papers/concurrency/version	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/concurrency/version	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,1 @@
+0.11.420
Index: doc/papers/general/.gitignore
===================================================================
--- doc/papers/general/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,19 @@
+# generated by latex
+*.aux
+*.bbl
+*.blg
+*.brf
+*.dvi
+*.idx
+*.ilg
+*.ind
+*.log
+*.out
+*.pdf
+*.ps
+*.toc
+*.lof
+*.lot
+*.synctex.gz
+comment.cut
+timing.tex
Index: doc/papers/general/Paper.tex
===================================================================
--- doc/papers/general/Paper.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/Paper.tex	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,1400 @@
+\documentclass{article}
+
+\usepackage{fullpage}
+\usepackage{xspace,calc,comment}
+\usepackage{upquote}									% switch curled `'" to straight
+\usepackage{listings}									% format program code
+\usepackage{rotating}
+\usepackage[usenames]{color}
+\usepackage{pslatex}					% reduce size of san serif font
+\usepackage[plainpages=false,pdfpagelabels,pdfpagemode=UseNone,pagebackref=true,breaklinks=true,colorlinks=true,linkcolor=blue,citecolor=blue,urlcolor=blue]{hyperref}
+
+\setlength{\textheight}{9in}
+%\oddsidemargin 0.0in
+\renewcommand{\topfraction}{0.8}		% float must be greater than X of the page before it is forced onto its own page
+\renewcommand{\bottomfraction}{0.8}		% float must be greater than X of the page before it is forced onto its own page
+\renewcommand{\floatpagefraction}{0.8}	% float must be greater than X of the page before it is forced onto its own page
+\renewcommand{\textfraction}{0.0}		% the entire page maybe devoted to floats with no text on the page at all
+
+\lefthyphenmin=4						% hyphen only after 4 characters
+\righthyphenmin=4
+
+% Names used in the document.
+
+\newcommand{\CFAIcon}{\textsf{C}\raisebox{\depth}{\rotatebox{180}{\textsf{A}}}\xspace} % Cforall symbolic name
+\newcommand{\CFA}{\protect\CFAIcon} % safe for section/caption
+\newcommand{\CFL}{\textrm{Cforall}\xspace} % Cforall symbolic name
+\newcommand{\Celeven}{\textrm{C11}\xspace} % C11 symbolic name
+\newcommand{\CC}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}\xspace} % C++ symbolic name
+\newcommand{\CCeleven}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}11\xspace} % C++11 symbolic name
+\newcommand{\CCfourteen}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}14\xspace} % C++14 symbolic name
+\newcommand{\CCseventeen}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}17\xspace} % C++17 symbolic name
+\newcommand{\CCtwenty}{\textrm{C}\kern-.1em\hbox{+\kern-.25em+}20\xspace} % C++20 symbolic name
+\newcommand{\CCV}{\rm C\kern-.1em\hbox{+\kern-.25em+}obj\xspace} % C++ virtual symbolic name
+\newcommand{\Csharp}{C\raisebox{-0.7ex}{\Large$^\sharp$}\xspace} % C# symbolic name
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\newcommand{\Textbf}[1]{{\color{red}\textbf{#1}}}
+\newcommand{\TODO}[1]{\textbf{TODO}: {\itshape #1}} % TODO included
+%\newcommand{\TODO}[1]{} % TODO elided
+% Default underscore is too low and wide. Cannot use lstlisting "literate" as replacing underscore
+% removes it as a variable-name character so keyworks in variables are highlighted
+\DeclareTextCommandDefault{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.1ex}}}
+
+\makeatletter
+% parindent is relative, i.e., toggled on/off in environments like itemize, so store the value for
+% use rather than use \parident directly.
+\newlength{\parindentlnth}
+\setlength{\parindentlnth}{\parindent}
+
+\newlength{\gcolumnposn}				% temporary hack because lstlisting does not handle tabs correctly
+\newlength{\columnposn}
+\setlength{\gcolumnposn}{2.75in}
+\setlength{\columnposn}{\gcolumnposn}
+\newcommand{\C}[2][\@empty]{\ifx#1\@empty\else\global\setlength{\columnposn}{#1}\global\columnposn=\columnposn\fi\hfill\makebox[\textwidth-\columnposn][l]{\lst@commentstyle{#2}}}
+\newcommand{\CRT}{\global\columnposn=\gcolumnposn}
+
+% Latin abbreviation
+\newcommand{\abbrevFont}{\textit}	% set empty for no italics
+\newcommand*{\eg}{%
+	\@ifnextchar{,}{\abbrevFont{e}.\abbrevFont{g}.}%
+		{\@ifnextchar{:}{\abbrevFont{e}.\abbrevFont{g}.}%
+			{\abbrevFont{e}.\abbrevFont{g}.,\xspace}}%
+}%
+\newcommand*{\ie}{%
+	\@ifnextchar{,}{\abbrevFont{i}.\abbrevFont{e}.}%
+		{\@ifnextchar{:}{\abbrevFont{i}.\abbrevFont{e}.}%
+			{\abbrevFont{i}.\abbrevFont{e}.,\xspace}}%
+}%
+\newcommand*{\etc}{%
+	\@ifnextchar{.}{\abbrevFont{etc}}%
+        {\abbrevFont{etc}.\xspace}%
+}%
+\newcommand{\etal}{%
+	\@ifnextchar{.}{\abbrevFont{et~al}}%
+	        {\abbrevFont{et al}.\xspace}%
+}%
+\makeatother
+
+% CFA programming language, based on ANSI C (with some gcc additions)
+\lstdefinelanguage{CFA}[ANSI]{C}{
+	morekeywords={_Alignas,_Alignof,__alignof,__alignof__,asm,__asm,__asm__,_At,_Atomic,__attribute,__attribute__,auto,
+		_Bool,catch,catchResume,choose,_Complex,__complex,__complex__,__const,__const__,disable,dtype,enable,__extension__,
+		fallthrough,fallthru,finally,forall,ftype,_Generic,_Imaginary,inline,__label__,lvalue,_Noreturn,one_t,otype,restrict,_Static_assert,
+		_Thread_local,throw,throwResume,trait,try,ttype,typeof,__typeof,__typeof__,zero_t},
+}%
+
+\lstset{
+language=CFA,
+columns=fullflexible,
+basicstyle=\linespread{0.9}\sf,							% reduce line spacing and use sanserif font
+stringstyle=\tt,										% use typewriter font
+tabsize=4,												% 4 space tabbing
+xleftmargin=\parindentlnth,								% indent code to paragraph indentation
+%mathescape=true,										% LaTeX math escape in CFA code $...$
+escapechar=\$,											% LaTeX escape in CFA code
+keepspaces=true,										%
+showstringspaces=false,									% do not show spaces with cup
+showlines=true,											% show blank lines at end of code
+aboveskip=4pt,											% spacing above/below code block
+belowskip=3pt,
+% replace/adjust listing characters that look bad in sanserif
+literate={-}{\makebox[1.4ex][c]{\raisebox{0.5ex}{\rule{1.2ex}{0.06ex}}}}1 {^}{\raisebox{0.6ex}{$\scriptscriptstyle\land\,$}}1
+	{~}{\raisebox{0.3ex}{$\scriptstyle\sim\,$}}1 % {`}{\ttfamily\upshape\hspace*{-0.1ex}`}1
+	{<-}{$\leftarrow$}2 {=>}{$\Rightarrow$}2 {->}{\makebox[1.4ex][c]{\raisebox{0.5ex}{\rule{1.2ex}{0.06ex}}}\kern-0.3ex\textgreater}2,
+moredelim=**[is][\color{red}]{`}{`},
+}% lstset
+
+% inline code @...@
+\lstMakeShortInline@%
+
+\title{Generic and Tuple Types with Efficient Dynamic Layout in \protect\CFA}
+
+\author{Aaron Moss, Robert Schluntz, Peter Buhr}
+% \email{a3moss@uwaterloo.ca}
+% \email{rschlunt@uwaterloo.ca}
+% \email{pabuhr@uwaterloo.ca}
+% \affiliation{%
+% 	\institution{University of Waterloo}
+% 	\department{David R. Cheriton School of Computer Science}
+% 	\streetaddress{Davis Centre, University of Waterloo}
+% 	\city{Waterloo}
+% 	\state{ON}
+% 	\postcode{N2L 3G1}
+% 	\country{Canada}
+% }
+
+%\terms{generic, tuple, variadic, types}
+%\keywords{generic types, tuple types, variadic types, polymorphic functions, C, Cforall}
+
+\begin{document}
+\maketitle
+
+
+\begin{abstract}
+The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from commercial operating-systems to hobby projects.
+This installation base and the programmers producing it represent a massive software-engineering investment spanning decades and likely to continue for decades more.
+Nonetheless, C, first standardized over thirty years ago, lacks many features that make programming in more modern languages safer and more productive.
+The goal of the \CFA project is to create an extension of C that provides modern safety and productivity features while still ensuring strong backwards compatibility with C and its programmers.
+Prior projects have attempted similar goals but failed to honour C programming-style; for instance, adding object-oriented or functional programming with garbage collection is a non-starter for many C developers.
+Specifically, \CFA is designed to have an orthogonal feature-set based closely on the C programming paradigm, so that \CFA features can be added \emph{incrementally} to existing C code-bases, and C programmers can learn \CFA extensions on an as-needed basis, preserving investment in existing code and engineers.
+This paper describes two \CFA extensions, generic and tuple types, details how their design avoids shortcomings of similar features in C and other C-like languages, and presents experimental results validating the design.
+\end{abstract}
+
+
+\section{Introduction and Background}
+
+The C programming language is a foundational technology for modern computing with millions of lines of code implementing everything from commercial operating-systems to hobby projects.
+This installation base and the programmers producing it represent a massive software-engineering investment spanning decades and likely to continue for decades more.
+The \cite{TIOBE} ranks the top 5 most popular programming languages as: Java 16\%, \Textbf{C 7\%}, \Textbf{\CC 5\%}, \Csharp 4\%, Python 4\% = 36\%, where the next 50 languages are less than 3\% each with a long tail.
+The top 3 rankings over the past 30 years are:
+\lstDeleteShortInline@%
+\begin{center}
+\setlength{\tabcolsep}{10pt}
+\begin{tabular}{@{}rccccccc@{}}
+		& 2017	& 2012	& 2007	& 2002	& 1997	& 1992	& 1987		\\ \hline
+Java	& 1		& 1		& 1		& 1		& 12	& -		& -			\\
+\Textbf{C}	& \Textbf{2}& \Textbf{2}& \Textbf{2}& \Textbf{2}& \Textbf{1}& \Textbf{1}& \Textbf{1}	\\
+\CC		& 3		& 3		& 3		& 3		& 2		& 2		& 4			\\
+\end{tabular}
+\end{center}
+\lstMakeShortInline@%
+Love it or hate it, C is extremely popular, highly used, and one of the few systems languages.
+In many cases, \CC is often used solely as a better C.
+Nonetheless, C, first standardized over thirty years ago, lacks many features that make programming in more modern languages safer and more productive.
+
+\CFA (pronounced ``C-for-all'', and written \CFA or Cforall) is an evolutionary extension of the C programming language that aims to add modern language features to C while maintaining both source compatibility with C and a familiar programming model for programmers.
+The four key design goals for \CFA~\cite{Bilson03} are:
+(1) The behaviour of standard C code must remain the same when translated by a \CFA compiler as when translated by a C compiler;
+(2) Standard C code must be as fast and as small when translated by a \CFA compiler as when translated by a C compiler;
+(3) \CFA code must be at least as portable as standard C code;
+(4) Extensions introduced by \CFA must be translated in the most efficient way possible.
+These goals ensure existing C code-bases can be converted to \CFA incrementally with minimal effort, and C programmers can productively generate \CFA code without training beyond the features being used.
+\CC is used similarly, but has the disadvantages of multiple legacy design-choices that cannot be updated and active divergence of the language model from C, requiring significant effort and training to incrementally add \CC to a C-based project.
+
+\CFA is currently implemented as a source-to-source translator from \CFA to the GCC-dialect of C~\cite{GCCExtensions}, allowing it to leverage the portability and code optimizations provided by GCC, meeting goals (1)--(3).
+Ultimately, a compiler is necessary for advanced features and optimal performance.
+
+This paper identifies shortcomings in existing approaches to generic and variadic data types in C-like languages and presents a design for generic and variadic types avoiding those shortcomings.
+Specifically, the solution is both reusable and type-checked, as well as conforming to the design goals of \CFA with ergonomic use of existing C abstractions.
+The new constructs are empirically compared with both standard C and \CC; the results show the new design is comparable in performance.
+
+
+\subsection{Polymorphic Functions}
+\label{sec:poly-fns}
+
+\CFA{}\hspace{1pt}'s polymorphism was originally formalized by \cite{Ditchfield92}, and first implemented by \cite{Bilson03}.
+The signature feature of \CFA is parametric-polymorphic functions~\cite{forceone:impl,Cormack90,Duggan96} with functions generalized using a @forall@ clause (giving the language its name):
+\begin{lstlisting}
+`forall( otype T )` T identity( T val ) { return val; }
+int forty_two = identity( 42 );				$\C{// T is bound to int, forty\_two == 42}$
+\end{lstlisting}
+The @identity@ function above can be applied to any complete \emph{object type} (or @otype@).
+The type variable @T@ is transformed into a set of additional implicit parameters encoding sufficient information about @T@ to create and return a variable of that type.
+The \CFA implementation passes the size and alignment of the type represented by an @otype@ parameter, as well as an assignment operator, constructor, copy constructor and destructor.
+If this extra information is not needed, \eg for a pointer, the type parameter can be declared as a \emph{data type} (or @dtype@).
+
+In \CFA, the polymorphism runtime-cost is spread over each polymorphic call, due to passing more arguments to polymorphic functions;
+the experiments in Section~\ref{sec:eval} show this overhead is similar to \CC virtual-function calls.
+A design advantage is that, unlike \CC template-functions, \CFA polymorphic-functions are compatible with C \emph{separate compilation}, preventing compilation and code bloat.
+
+Since bare polymorphic-types provide a restricted set of available operations, \CFA provides a \emph{type assertion}~\cite[pp.~37-44]{Alphard} mechanism to provide further type information, where type assertions may be variable or function declarations that depend on a polymorphic type-variable.
+For example, the function @twice@ can be defined using the \CFA syntax for operator overloading:
+\begin{lstlisting}
+forall( otype T `| { T ?+?(T, T); }` ) T twice( T x ) { return x + x; }	$\C{// ? denotes operands}$
+int val = twice( twice( 3.7 ) );
+\end{lstlisting}
+which works for any type @T@ with a matching addition operator.
+The polymorphism is achieved by creating a wrapper function for calling @+@ with @T@ bound to @double@, then passing this function to the first call of @twice@.
+There is now the option of using the same @twice@ and converting the result to @int@ on assignment, or creating another @twice@ with type parameter @T@ bound to @int@ because \CFA uses the return type~\cite{Cormack81,Baker82,Ada}, in its type analysis.
+The first approach has a late conversion from @double@ to @int@ on the final assignment, while the second has an eager conversion to @int@.
+\CFA minimizes the number of conversions and their potential to lose information, so it selects the first approach, which corresponds with C-programmer intuition.
+
+Crucial to the design of a new programming language are the libraries to access thousands of external software features.
+Like \CC, \CFA inherits a massive compatible library-base, where other programming languages must rewrite or provide fragile inter-language communication with C.
+A simple example is leveraging the existing type-unsafe (@void *@) C @bsearch@ to binary search a sorted floating-point array:
+\begin{lstlisting}
+void * bsearch( const void * key, const void * base, size_t nmemb, size_t size,
+				int (* compar)( const void *, const void * ));
+int comp( const void * t1, const void * t2 ) { return *(double *)t1 < *(double *)t2 ? -1 :
+				*(double *)t2 < *(double *)t1 ? 1 : 0; }
+double key = 5.0, vals[10] = { /* 10 sorted floating-point values */ };
+double * val = (double *)bsearch( &key, vals, 10, sizeof(vals[0]), comp );	$\C{// search sorted array}$
+\end{lstlisting}
+which can be augmented simply with a generalized, type-safe, \CFA-overloaded wrappers:
+\begin{lstlisting}
+forall( otype T | { int ?<?( T, T ); } ) T * bsearch( T key, const T * arr, size_t size ) {
+	int comp( const void * t1, const void * t2 ) { /* as above with double changed to T */ }
+	return (T *)bsearch( &key, arr, size, sizeof(T), comp ); }
+forall( otype T | { int ?<?( T, T ); } ) unsigned int bsearch( T key, const T * arr, size_t size ) {
+	T * result = bsearch( key, arr, size );	$\C{// call first version}$
+	return result ? result - arr : size; }	$\C{// pointer subtraction includes sizeof(T)}$
+double * val = bsearch( 5.0, vals, 10 );	$\C{// selection based on return type}$
+int posn = bsearch( 5.0, vals, 10 );
+\end{lstlisting}
+The nested function @comp@ provides the hidden interface from typed \CFA to untyped (@void *@) C, plus the cast of the result.
+Providing a hidden @comp@ function in \CC is awkward as lambdas do not use C calling-conventions and template declarations cannot appear at block scope.
+As well, an alternate kind of return is made available: position versus pointer to found element.
+\CC's type-system cannot disambiguate between the two versions of @bsearch@ because it does not use the return type in overload resolution, nor can \CC separately compile a templated @bsearch@.
+
+\CFA has replacement libraries condensing hundreds of existing C functions into tens of \CFA overloaded functions, all without rewriting the actual computations.
+For example, it is possible to write a type-safe \CFA wrapper @malloc@ based on the C @malloc@:
+\begin{lstlisting}
+forall( dtype T | sized(T) ) T * malloc( void ) { return (T *)malloc( sizeof(T) ); }
+int * ip = malloc();						$\C{// select type and size from left-hand side}$
+double * dp = malloc();
+struct S {...} * sp = malloc();
+\end{lstlisting}
+where the return type supplies the type/size of the allocation, which is impossible in most type systems.
+
+Call-site inferencing and nested functions provide a localized form of inheritance.
+For example, the \CFA @qsort@ only sorts in ascending order using @<@.
+However, it is trivial to locally change this behaviour:
+\begin{lstlisting}
+forall( otype T | { int ?<?( T, T ); } ) void qsort( const T * arr, size_t size ) { /* use C qsort */ }
+{	int ?<?( double x, double y ) { return x `>` y; }	$\C{// locally override behaviour}$
+	qsort( vals, size );					$\C{// descending sort}$
+}
+\end{lstlisting}
+Within the block, the nested version of @<@ performs @>@ and this local version overrides the built-in @<@ so it is passed to @qsort@.
+Hence, programmers can easily form local environments, adding and modifying appropriate functions, to maximize reuse of other existing functions and types.
+
+Finally, \CFA allows variable overloading:
+\begin{lstlisting}
+short int MAX = ...;   int MAX = ...;  double MAX = ...;
+short int s = MAX;    int i = MAX;    double d = MAX;   $\C{// select correct MAX}$
+\end{lstlisting}
+Here, the single name @MAX@ replaces all the C type-specific names: @SHRT_MAX@, @INT_MAX@, @DBL_MAX@.
+As well, restricted constant overloading is allowed for the values @0@ and @1@, which have special status in C, \eg the value @0@ is both an integer and a pointer literal, so its meaning depends on context.
+In addition, several operations are defined in terms values @0@ and @1@, \eg:
+\begin{lstlisting}
+int x;
+if (x) x++									$\C{// if (x != 0) x += 1;}$
+\end{lstlisting}
+Every @if@ and iteration statement in C compares the condition with @0@, and every increment and decrement operator is semantically equivalent to adding or subtracting the value @1@ and storing the result.
+Due to these rewrite rules, the values @0@ and @1@ have the types @zero_t@ and @one_t@ in \CFA, which allows overloading various operations for new types that seamlessly connect to all special @0@ and @1@ contexts.
+The types @zero_t@ and @one_t@ have special built in implicit conversions to the various integral types, and a conversion to pointer types for @0@, which allows standard C code involving @0@ and @1@ to work as normal.
+
+
+\subsection{Traits}
+
+\CFA provides \emph{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each function declaration:
+\begin{lstlisting}
+trait summable( otype T ) {
+	void ?{}( T *, zero_t );				$\C{// constructor from 0 literal}$
+	T ?+?( T, T );							$\C{// assortment of additions}$
+	T ?+=?( T *, T );
+	T ++?( T * );
+	T ?++( T * ); };
+forall( otype T `| summable( T )` ) T sum( T a[$\,$], size_t size ) {  // use trait
+	`T` total = { `0` };					$\C{// instantiate T from 0 by calling its constructor}$
+	for ( unsigned int i = 0; i < size; i += 1 ) total `+=` a[i]; $\C{// select appropriate +}$
+	return total; }
+\end{lstlisting}
+
+In fact, the set of @summable@ trait operators is incomplete, as it is missing assignment for type @T@, but @otype@ is syntactic sugar for the following implicit trait:
+\begin{lstlisting}
+trait otype( dtype T | sized(T) ) {  // sized is a pseudo-trait for types with known size and alignment
+	void ?{}( T * );						$\C{// default constructor}$
+	void ?{}( T *, T );						$\C{// copy constructor}$
+	void ?=?( T *, T );						$\C{// assignment operator}$
+	void ^?{}( T * ); };					$\C{// destructor}$
+\end{lstlisting}
+Given the information provided for an @otype@, variables of polymorphic type can be treated as if they were a complete type: stack-allocatable, default or copy-initialized, assigned, and deleted.
+
+In summation, the \CFA type-system uses \emph{nominal typing} for concrete types, matching with the C type-system, and \emph{structural typing} for polymorphic types.
+Hence, trait names play no part in type equivalence;
+the names are simply macros for a list of polymorphic assertions, which are expanded at usage sites.
+Nevertheless, trait names form a logical subtype-hierarchy with @dtype@ at the top, where traits often contain overlapping assertions, \eg operator @+@.
+Traits are used like interfaces in Java or abstract base-classes in \CC, but without the nominal inheritance-relationships.
+Instead, each polymorphic function (or generic type) defines the structural type needed for its execution (polymorphic type-key), and this key is fulfilled at each call site from the lexical environment, which is similar to Go~\cite{Go} interfaces.
+Hence, new lexical scopes and nested functions are used extensively to create local subtypes, as in the @qsort@ example, without having to manage a nominal-inheritance hierarchy.
+(Nominal inheritance can be approximated with traits using marker variables or functions, as is done in Go.)
+
+% Nominal inheritance can be simulated with traits using marker variables or functions:
+% \begin{lstlisting}
+% trait nominal(otype T) {
+%     T is_nominal;
+% };
+% int is_nominal;								$\C{// int now satisfies the nominal trait}$
+% \end{lstlisting}
+%
+% Traits, however, are significantly more powerful than nominal-inheritance interfaces; most notably, traits may be used to declare a relationship \emph{among} multiple types, a property that may be difficult or impossible to represent in nominal-inheritance type systems:
+% \begin{lstlisting}
+% trait pointer_like(otype Ptr, otype El) {
+%     lvalue El *?(Ptr);						$\C{// Ptr can be dereferenced into a modifiable value of type El}$
+% }
+% struct list {
+%     int value;
+%     list * next;								$\C{// may omit "struct" on type names as in \CC}$
+% };
+% typedef list * list_iterator;
+%
+% lvalue int *?( list_iterator it ) { return it->value; }
+% \end{lstlisting}
+% In the example above, @(list_iterator, int)@ satisfies @pointer_like@ by the user-defined dereference function, and @(list_iterator, list)@ also satisfies @pointer_like@ by the built-in dereference operator for pointers. Given a declaration @list_iterator it@, @*it@ can be either an @int@ or a @list@, with the meaning disambiguated by context (\eg @int x = *it;@ interprets @*it@ as an @int@, while @(*it).value = 42;@ interprets @*it@ as a @list@).
+% While a nominal-inheritance system with associated types could model one of those two relationships by making @El@ an associated type of @Ptr@ in the @pointer_like@ implementation, few such systems could model both relationships simultaneously.
+
+
+\section{Generic Types}
+
+One of the known shortcomings of standard C is that it does not provide reusable type-safe abstractions for generic data structures and algorithms.
+Broadly speaking, there are three approaches to implement abstract data-structures in C.
+One approach is to write bespoke data-structures for each context in which they are needed.
+While this approach is flexible and supports integration with the C type-checker and tooling, it is also tedious and error-prone, especially for more complex data structures.
+A second approach is to use @void *@--based polymorphism, \eg the C standard-library functions @bsearch@ and @qsort@; an approach which does allow reuse of code for common functionality.
+However, basing all polymorphism on @void *@ eliminates the type-checker's ability to ensure that argument types are properly matched, often requiring a number of extra function parameters, pointer indirection, and dynamic allocation that would not otherwise be needed.
+A third approach to generic code is to use preprocessor macros, which does allow the generated code to be both generic and type-checked, but errors may be difficult to interpret.
+Furthermore, writing and using preprocessor macros can be unnatural and inflexible.
+
+\CC, Java, and other languages use \emph{generic types} to produce type-safe abstract data-types.
+\CFA also implements generic types that integrate efficiently and naturally with the existing polymorphic functions, while retaining backwards compatibility with C and providing separate compilation.
+However, for known concrete parameters, the generic-type definition can be inlined, like \CC templates.
+
+A generic type can be declared by placing a @forall@ specifier on a @struct@ or @union@ declaration, and instantiated using a parenthesized list of types after the type name:
+\begin{lstlisting}
+forall( otype R, otype S ) struct pair {
+	R first;
+	S second;
+};
+forall( otype T ) T value( pair( const char *, T ) p ) { return p.second; }
+forall( dtype F, otype T ) T value_p( pair( F *, T * ) p ) { return * p.second; }
+pair( const char *, int ) p = { "magic", 42 };
+int magic = value( p );
+pair( void *, int * ) q = { 0, &p.second };
+magic = value_p( q );
+double d = 1.0;
+pair( double *, double * ) r = { &d, &d };
+d = value_p( r );
+\end{lstlisting}
+
+\CFA classifies generic types as either \emph{concrete} or \emph{dynamic}.
+Concrete types have a fixed memory layout regardless of type parameters, while dynamic types vary in memory layout depending on their type parameters.
+A type may have polymorphic parameters but still be concrete, called \emph{dtype-static}.
+Polymorphic pointers are an example of dtype-static types, \eg @forall(dtype T) T *@ is a polymorphic type, but for any @T@, @T *@  is a fixed-sized pointer, and therefore, can be represented by a @void *@ in code generation.
+
+\CFA generic types also allow checked argument-constraints.
+For example, the following declaration of a sorted set-type ensures the set key supports equality and relational comparison:
+\begin{lstlisting}
+forall( otype Key | { _Bool ?==?(Key, Key); _Bool ?<?(Key, Key); } ) struct sorted_set;
+\end{lstlisting}
+
+
+\subsection{Concrete Generic-Types}
+
+The \CFA translator template-expands concrete generic-types into new structure types, affording maximal inlining.
+To enable inter-operation among equivalent instantiations of a generic type, the translator saves the set of instantiations currently in scope and reuses the generated structure declarations where appropriate.
+A function declaration that accepts or returns a concrete generic-type produces a declaration for the instantiated structure in the same scope, which all callers may reuse.
+For example, the concrete instantiation for @pair( const char *, int )@ is:
+\begin{lstlisting}
+struct _pair_conc1 {
+	const char * first;
+	int second;
+};
+\end{lstlisting}
+
+A concrete generic-type with dtype-static parameters is also expanded to a structure type, but this type is used for all matching instantiations.
+In the above example, the @pair( F *, T * )@ parameter to @value_p@ is such a type; its expansion is below and it is used as the type of the variables @q@ and @r@ as well, with casts for member access where appropriate:
+\begin{lstlisting}
+struct _pair_conc0 {
+	void * first;
+	void * second;
+};
+\end{lstlisting}
+
+
+\subsection{Dynamic Generic-Types}
+
+Though \CFA implements concrete generic-types efficiently, it also has a fully general system for dynamic generic types.
+As mentioned in Section~\ref{sec:poly-fns}, @otype@ function parameters (in fact all @sized@ polymorphic parameters) come with implicit size and alignment parameters provided by the caller.
+Dynamic generic-types also have an \emph{offset array} containing structure-member offsets.
+A dynamic generic-union needs no such offset array, as all members are at offset 0, but size and alignment are still necessary.
+Access to members of a dynamic structure is provided at runtime via base-displacement addressing with the structure pointer and the member offset (similar to the @offsetof@ macro), moving a compile-time offset calculation to runtime.
+
+The offset arrays are statically generated where possible.
+If a dynamic generic-type is declared to be passed or returned by value from a polymorphic function, the translator can safely assume the generic type is complete (\ie has a known layout) at any call-site, and the offset array is passed from the caller;
+if the generic type is concrete at the call site, the elements of this offset array can even be statically generated using the C @offsetof@ macro.
+As an example, @p.second@ in the @value@ function above is implemented as @*(p + _offsetof_pair[1])@, where @p@ is a @void *@, and @_offsetof_pair@ is the offset array passed into @value@ for @pair( const char *, T )@.
+The offset array @_offsetof_pair@ is generated at the call site as @size_t _offsetof_pair[] = { offsetof(_pair_conc1, first), offsetof(_pair_conc1, second) }@.
+
+In some cases the offset arrays cannot be statically generated.
+For instance, modularity is generally provided in C by including an opaque forward-declaration of a structure and associated accessor and mutator functions in a header file, with the actual implementations in a separately-compiled @.c@ file.
+\CFA supports this pattern for generic types, but the caller does not know the actual layout or size of the dynamic generic-type, and only holds it by a pointer.
+The \CFA translator automatically generates \emph{layout functions} for cases where the size, alignment, and offset array of a generic struct cannot be passed into a function from that function's caller.
+These layout functions take as arguments pointers to size and alignment variables and a caller-allocated array of member offsets, as well as the size and alignment of all @sized@ parameters to the generic structure (un@sized@ parameters are forbidden from being used in a context that affects layout).
+Results of these layout functions are cached so that they are only computed once per type per function. %, as in the example below for @pair@.
+Layout functions also allow generic types to be used in a function definition without reflecting them in the function signature.
+For instance, a function that strips duplicate values from an unsorted @vector(T)@ would likely have a pointer to the vector as its only explicit parameter, but use some sort of @set(T)@ internally to test for duplicate values.
+This function could acquire the layout for @set(T)@ by calling its layout function with the layout of @T@ implicitly passed into the function.
+
+Whether a type is concrete, dtype-static, or dynamic is decided solely on the @forall@'s type parameters.
+This design allows opaque forward declarations of generic types, \eg @forall(otype T)@ @struct Box@ -- like in C, all uses of @Box(T)@ can be separately compiled, and callers from other translation units know the proper calling conventions to use.
+If the definition of a structure type is included in deciding whether a generic type is dynamic or concrete, some further types may be recognized as dtype-static (\eg @forall(otype T)@ @struct unique_ptr { T * p }@ does not depend on @T@ for its layout, but the existence of an @otype@ parameter means that it \emph{could}.), but preserving separate compilation (and the associated C compatibility) in the existing design is judged to be an appropriate trade-off.
+
+
+\subsection{Applications}
+\label{sec:generic-apps}
+
+The reuse of dtype-static structure instantiations enables useful programming patterns at zero runtime cost.
+The most important such pattern is using @forall(dtype T) T *@ as a type-checked replacement for @void *@, \eg creating a lexicographic comparison for pairs of pointers used by @bsearch@ or @qsort@:
+\begin{lstlisting}
+forall(dtype T) int lexcmp( pair( T *, T * ) * a, pair( T *, T * ) * b, int (* cmp)( T *, T * ) ) {
+	return cmp( a->first, b->first ) ? : cmp( a->second, b->second );
+}
+\end{lstlisting}
+Since @pair(T *, T * )@ is a concrete type, there are no implicit parameters passed to @lexcmp@, so the generated code is identical to a function written in standard C using @void *@, yet the \CFA version is type-checked to ensure the fields of both pairs and the arguments to the comparison function match in type.
+
+Another useful pattern enabled by reused dtype-static type instantiations is zero-cost \emph{tag-structures}.
+Sometimes information is only used for type-checking and can be omitted at runtime, \eg:
+\begin{lstlisting}
+forall(dtype Unit) struct scalar { unsigned long value; };
+struct metres {};
+struct litres {};
+
+forall(dtype U) scalar(U) ?+?( scalar(U) a, scalar(U) b ) {
+	return (scalar(U)){ a.value + b.value };
+}
+scalar(metres) half_marathon = { 21093 };
+scalar(litres) swimming_pool = { 2500000 };
+scalar(metres) marathon = half_marathon + half_marathon;
+scalar(litres) two_pools = swimming_pool + swimming_pool;
+marathon + swimming_pool;					$\C{// compilation ERROR}$
+\end{lstlisting}
+@scalar@ is a dtype-static type, so all uses have a single structure definition, containing @unsigned long@, and can share the same implementations of common functions like @?+?@.
+These implementations may even be separately compiled, unlike \CC template functions.
+However, the \CFA type-checker ensures matching types are used by all calls to @?+?@, preventing nonsensical computations like adding a length to a volume.
+
+
+\section{Tuples}
+\label{sec:tuples}
+
+In many languages, functions can return at most one value;
+however, many operations have multiple outcomes, some exceptional.
+Consider C's @div@ and @remquo@ functions, which return the quotient and remainder for a division of integer and floating-point values, respectively.
+\begin{lstlisting}
+typedef struct { int quo, rem; } div_t;		$\C{// from include stdlib.h}$
+div_t div( int num, int den );
+double remquo( double num, double den, int * quo );
+div_t qr = div( 13, 5 );					$\C{// return quotient/remainder aggregate}$
+int q;
+double r = remquo( 13.5, 5.2, &q );			$\C{// return remainder, alias quotient}$
+\end{lstlisting}
+@div@ aggregates the quotient/remainder in a structure, while @remquo@ aliases a parameter to an argument.
+Both approaches are awkward.
+Alternatively, a programming language can directly support returning multiple values, \eg in \CFA:
+\begin{lstlisting}
+[ int, int ] div( int num, int den );		$\C{// return two integers}$
+[ double, double ] div( double num, double den ); $\C{// return two doubles}$
+int q, r;									$\C{// overloaded variable names}$
+double q, r;
+[ q, r ] = div( 13, 5 );					$\C{// select appropriate div and q, r}$
+[ q, r ] = div( 13.5, 5.2 );				$\C{// assign into tuple}$
+\end{lstlisting}
+Clearly, this approach is straightforward to understand and use;
+therefore, why do few programming languages support this obvious feature or provide it awkwardly?
+The answer is that there are complex consequences that cascade through multiple aspects of the language, especially the type-system.
+This section show these consequences and how \CFA handles them.
+
+
+\subsection{Tuple Expressions}
+
+The addition of multiple-return-value functions (MRVF) are useless without a syntax for accepting multiple values at the call-site.
+The simplest mechanism for capturing the return values is variable assignment, allowing the values to be retrieved directly.
+As such, \CFA allows assigning multiple values from a function into multiple variables, using a square-bracketed list of lvalue expressions (as above), called a \emph{tuple}.
+
+However, functions also use \emph{composition} (nested calls), with the direct consequence that MRVFs must also support composition to be orthogonal with single-returning-value functions (SRVF), \eg:
+\begin{lstlisting}
+printf( "%d %d\n", div( 13, 5 ) );			$\C{// return values seperated into arguments}$
+\end{lstlisting}
+Here, the values returned by @div@ are composed with the call to @printf@ by flattening the tuple into separate arguments.
+However, the \CFA type-system must support significantly more complex composition:
+\begin{lstlisting}
+[ int, int ] foo$\(_1\)$( int );			$\C{// overloaded foo functions}$
+[ double ] foo$\(_2\)$( int );
+void bar( int, double, double );
+bar( foo( 3 ), foo( 3 ) );
+\end{lstlisting}
+The type-resolver only has the tuple return-types to resolve the call to @bar@ as the @foo@ parameters are identical, which involves unifying the possible @foo@ functions with @bar@'s parameter list.
+No combination of @foo@s are an exact match with @bar@'s parameters, so the resolver applies C conversions.
+The minimal cost is @bar( foo@$_1$@( 3 ), foo@$_2$@( 3 ) )@, giving (@int@, {\color{ForestGreen}@int@}, @double@) to (@int@, {\color{ForestGreen}@double@}, @double@) with one {\color{ForestGreen}safe} (widening) conversion from @int@ to @double@ versus ({\color{red}@double@}, {\color{ForestGreen}@int@}, {\color{ForestGreen}@int@}) to ({\color{red}@int@}, {\color{ForestGreen}@double@}, {\color{ForestGreen}@double@}) with one {\color{red}unsafe} (narrowing) conversion from @double@ to @int@ and two safe conversions.
+
+
+\subsection{Tuple Variables}
+
+An important observation from function composition is that new variable names are not required to initialize parameters from an MRVF.
+\CFA also allows declaration of tuple variables that can be initialized from an MRVF, since it can be awkward to declare multiple variables of different types, \eg:
+\begin{lstlisting}
+[ int, int ] qr = div( 13, 5 );				$\C{// tuple-variable declaration and initialization}$
+[ double, double ] qr = div( 13.5, 5.2 );
+\end{lstlisting}
+where the tuple variable-name serves the same purpose as the parameter name(s).
+Tuple variables can be composed of any types, except for array types, since array sizes are generally unknown in C.
+
+One way to access the tuple-variable components is with assignment or composition:
+\begin{lstlisting}
+[ q, r ] = qr;								$\C{// access tuple-variable components}$
+printf( "%d %d\n", qr );
+\end{lstlisting}
+\CFA also supports \emph{tuple indexing} to access single components of a tuple expression:
+\begin{lstlisting}
+[int, int] * p = &qr;						$\C{// tuple pointer}$
+int rem = qr`.1`;							$\C{// access remainder}$
+int quo = div( 13, 5 )`.0`;					$\C{// access quotient}$
+p`->0` = 5;									$\C{// change quotient}$
+bar( qr`.1`, qr );							$\C{// pass remainder and quotient/remainder}$
+rem = [42, div( 13, 5 )]`.0.1`;				$\C{// access 2nd component of 1st component of tuple expression}$
+\end{lstlisting}
+
+
+\subsection{Flattening and Restructuring}
+
+In function call contexts, tuples support implicit flattening and restructuring conversions.
+Tuple flattening recursively expands a tuple into the list of its basic components.
+Tuple structuring packages a list of expressions into a value of tuple type, \eg:
+%\lstDeleteShortInline@%
+%\par\smallskip
+%\begin{tabular}{@{}l@{\hspace{1.5\parindent}}||@{\hspace{1.5\parindent}}l@{}}
+\begin{lstlisting}
+int f( int, int );
+int g( [int, int] );
+int h( int, [int, int] );
+[int, int] x;
+int y;
+f( x );			$\C{// flatten}$
+g( y, 10 );		$\C{// structure}$
+h( x, y );		$\C{// flatten and structure}$
+\end{lstlisting}
+%\end{lstlisting}
+%&
+%\begin{lstlisting}
+%\end{tabular}
+%\smallskip\par\noindent
+%\lstMakeShortInline@%
+In the call to @f@, @x@ is implicitly flattened so the components of @x@ are passed as the two arguments.
+In the call to @g@, the values @y@ and @10@ are structured into a single argument of type @[int, int]@ to match the parameter type of @g@.
+Finally, in the call to @h@, @x@ is flattened to yield an argument list of length 3, of which the first component of @x@ is passed as the first parameter of @h@, and the second component of @x@ and @y@ are structured into the second argument of type @[int, int]@.
+The flexible structure of tuples permits a simple and expressive function call syntax to work seamlessly with both SRVF and MRVF, and with any number of arguments of arbitrarily complex structure.
+
+
+\subsection{Tuple Assignment}
+
+An assignment where the left side is a tuple type is called \emph{tuple assignment}.
+There are two kinds of tuple assignment depending on whether the right side of the assignment operator has a tuple type or a non-tuple type, called \emph{multiple} and \emph{mass assignment}, respectively.
+%\lstDeleteShortInline@%
+%\par\smallskip
+%\begin{tabular}{@{}l@{\hspace{1.5\parindent}}||@{\hspace{1.5\parindent}}l@{}}
+\begin{lstlisting}
+int x = 10;
+double y = 3.5;
+[int, double] z;
+z = [x, y];									$\C{// multiple assignment}$
+[x, y] = z;									$\C{// multiple assignment}$
+z = 10;										$\C{// mass assignment}$
+[y, x] = 3.14;								$\C{// mass assignment}$
+\end{lstlisting}
+%\end{lstlisting}
+%&
+%\begin{lstlisting}
+%\end{tabular}
+%\smallskip\par\noindent
+%\lstMakeShortInline@%
+Both kinds of tuple assignment have parallel semantics, so that each value on the left and right side is evaluated before any assignments occur.
+As a result, it is possible to swap the values in two variables without explicitly creating any temporary variables or calling a function, \eg, @[x, y] = [y, x]@.
+This semantics means mass assignment differs from C cascading assignment (\eg @a = b = c@) in that conversions are applied in each individual assignment, which prevents data loss from the chain of conversions that can happen during a cascading assignment.
+For example, @[y, x] = 3.14@ performs the assignments @y = 3.14@ and @x = 3.14@, yielding @y == 3.14@ and @x == 3@;
+whereas, C cascading assignment @y = x = 3.14@ performs the assignments @x = 3.14@ and @y = x@, yielding @3@ in @y@ and @x@.
+Finally, tuple assignment is an expression where the result type is the type of the left-hand side of the assignment, just like all other assignment expressions in C.
+This example shows mass, multiple, and cascading assignment used in one expression:
+\begin{lstlisting}
+void f( [int, int] );
+f( [x, y] = z = 1.5 );						$\C{// assignments in parameter list}$
+\end{lstlisting}
+
+
+\subsection{Member Access}
+
+It is also possible to access multiple fields from a single expression using a \emph{member-access}.
+The result is a single tuple-valued expression whose type is the tuple of the types of the members, \eg:
+\begin{lstlisting}
+struct S { int x; double y; char * z; } s;
+s.[x, y, z] = 0;
+\end{lstlisting}
+Here, the mass assignment sets all members of @s@ to zero.
+Since tuple-index expressions are a form of member-access expression, it is possible to use tuple-index expressions in conjunction with member tuple expressions to manually restructure a tuple (\eg rearrange, drop, and duplicate components).
+%\lstDeleteShortInline@%
+%\par\smallskip
+%\begin{tabular}{@{}l@{\hspace{1.5\parindent}}||@{\hspace{1.5\parindent}}l@{}}
+\begin{lstlisting}
+[int, int, long, double] x;
+void f( double, long );
+x.[0, 1] = x.[1, 0];						$\C{// rearrange: [x.0, x.1] = [x.1, x.0]}$
+f( x.[0, 3] );								$\C{// drop: f(x.0, x.3)}$
+[int, int, int] y = x.[2, 0, 2];			$\C{// duplicate: [y.0, y.1, y.2] = [x.2, x.0.x.2]}$
+\end{lstlisting}
+%\end{lstlisting}
+%&
+%\begin{lstlisting}
+%\end{tabular}
+%\smallskip\par\noindent
+%\lstMakeShortInline@%
+It is also possible for a member access to contain other member accesses, \eg:
+\begin{lstlisting}
+struct A { double i; int j; };
+struct B { int * k; short l; };
+struct C { int x; A y; B z; } v;
+v.[x, y.[i, j], z.k];						$\C{// [v.x, [v.y.i, v.y.j], v.z.k]}$
+\end{lstlisting}
+
+
+\begin{comment}
+\subsection{Casting}
+
+In C, the cast operator is used to explicitly convert between types.
+In \CFA, the cast operator has a secondary use as type ascription.
+That is, a cast can be used to select the type of an expression when it is ambiguous, as in the call to an overloaded function:
+\begin{lstlisting}
+int f();     // (1)
+double f();  // (2)
+
+f();       // ambiguous - (1),(2) both equally viable
+(int)f();  // choose (2)
+\end{lstlisting}
+
+Since casting is a fundamental operation in \CFA, casts should be given a meaningful interpretation in the context of tuples.
+Taking a look at standard C provides some guidance with respect to the way casts should work with tuples:
+\begin{lstlisting}
+int f();
+void g();
+
+(void)f();  // (1)
+(int)g();  // (2)
+\end{lstlisting}
+In C, (1) is a valid cast, which calls @f@ and discards its result.
+On the other hand, (2) is invalid, because @g@ does not produce a result, so requesting an @int@ to materialize from nothing is nonsensical.
+Generalizing these principles, any cast wherein the number of components increases as a result of the cast is invalid, while casts that have the same or fewer number of components may be valid.
+
+Formally, a cast to tuple type is valid when $T_n \leq S_m$, where $T_n$ is the number of components in the target type and $S_m$ is the number of components in the source type, and for each $i$ in $[0, n)$, $S_i$ can be cast to $T_i$.
+Excess elements ($S_j$ for all $j$ in $[n, m)$) are evaluated, but their values are discarded so that they are not included in the result expression.
+This approach follows naturally from the way that a cast to @void@ works in C.
+
+For example, in
+\begin{lstlisting}
+[int, int, int] f();
+[int, [int, int], int] g();
+
+([int, double])f();           $\C{// (1)}$
+([int, int, int])g();         $\C{// (2)}$
+([void, [int, int]])g();      $\C{// (3)}$
+([int, int, int, int])g();    $\C{// (4)}$
+([int, [int, int, int]])g();  $\C{// (5)}$
+\end{lstlisting}
+
+(1) discards the last element of the return value and converts the second element to @double@.
+Since @int@ is effectively a 1-element tuple, (2) discards the second component of the second element of the return value of @g@.
+If @g@ is free of side effects, this expression is equivalent to @[(int)(g().0), (int)(g().1.0), (int)(g().2)]@.
+Since @void@ is effectively a 0-element tuple, (3) discards the first and third return values, which is effectively equivalent to @[(int)(g().1.0), (int)(g().1.1)]@).
+
+Note that a cast is not a function call in \CFA, so flattening and structuring conversions do not occur for cast expressions\footnote{User-defined conversions have been considered, but for compatibility with C and the existing use of casts as type ascription, any future design for such conversions would require more precise matching of types than allowed for function arguments and parameters.}.
+As such, (4) is invalid because the cast target type contains 4 components, while the source type contains only 3.
+Similarly, (5) is invalid because the cast @([int, int, int])(g().1)@ is invalid.
+That is, it is invalid to cast @[int, int]@ to @[int, int, int]@.
+\end{comment}
+
+
+\subsection{Polymorphism}
+
+Tuples also integrate with \CFA polymorphism as a kind of generic type.
+Due to the implicit flattening and structuring conversions involved in argument passing, @otype@ and @dtype@ parameters are restricted to matching only with non-tuple types, \eg:
+\begin{lstlisting}
+forall(otype T, dtype U) void f( T x, U * y );
+f( [5, "hello"] );
+\end{lstlisting}
+where @[5, "hello"]@ is flattened, giving argument list @5, "hello"@, and @T@ binds to @int@ and @U@ binds to @const char@.
+Tuples, however, may contain polymorphic components.
+For example, a plus operator can be written to add two triples together.
+\begin{lstlisting}
+forall(otype T | { T ?+?( T, T ); }) [T, T, T] ?+?( [T, T, T] x, [T, T, T] y ) {
+	return [x.0 + y.0, x.1 + y.1, x.2 + y.2];
+}
+[int, int, int] x;
+int i1, i2, i3;
+[i1, i2, i3] = x + ([10, 20, 30]);
+\end{lstlisting}
+
+Flattening and restructuring conversions are also applied to tuple types in polymorphic type assertions.
+\begin{lstlisting}
+int f( [int, double], double );
+forall(otype T, otype U | { T f( T, U, U ); }) void g( T, U );
+g( 5, 10.21 );
+\end{lstlisting}
+Hence, function parameter and return lists are flattened for the purposes of type unification allowing the example to pass expression resolution.
+This relaxation is possible by extending the thunk scheme described by~\cite{Bilson03}.
+Whenever a candidate's parameter structure does not exactly match the formal parameter's structure, a thunk is generated to specialize calls to the actual function:
+\begin{lstlisting}
+int _thunk( int _p0, double _p1, double _p2 ) { return f( [_p0, _p1], _p2 ); }
+\end{lstlisting}
+so the thunk provides flattening and structuring conversions to inferred functions, improving the compatibility of tuples and polymorphism.
+These thunks take advantage of GCC C nested-functions to produce closures that have the usual function-pointer signature.
+
+
+\subsection{Variadic Tuples}
+\label{sec:variadic-tuples}
+
+To define variadic functions, \CFA adds a new kind of type parameter, @ttype@ (tuple type).
+Matching against a @ttype@ parameter consumes all remaining argument components and packages them into a tuple, binding to the resulting tuple of types.
+In a given parameter list, there must be at most one @ttype@ parameter that occurs last, which matches normal variadic semantics, with a strong feeling of similarity to \CCeleven variadic templates.
+As such, @ttype@ variables are also called \emph{argument packs}.
+
+Like variadic templates, the main way to manipulate @ttype@ polymorphic functions is via recursion.
+Since nothing is known about a parameter pack by default, assertion parameters are key to doing anything meaningful.
+Unlike variadic templates, @ttype@ polymorphic functions can be separately compiled.
+For example, a generalized @sum@ function written using @ttype@:
+\begin{lstlisting}
+int sum$\(_0\)$() { return 0; }
+forall(ttype Params | { int sum( Params ); } ) int sum$\(_1\)$( int x, Params rest ) {
+	return x + sum( rest );
+}
+sum( 10, 20, 30 );
+\end{lstlisting}
+Since @sum@\(_0\) does not accept any arguments, it is not a valid candidate function for the call @sum(10, 20, 30)@.
+In order to call @sum@\(_1\), @10@ is matched with @x@, and the argument resolution moves on to the argument pack @rest@, which consumes the remainder of the argument list and @Params@ is bound to @[20, 30]@.
+The process continues unitl @Params@ is bound to @[]@, requiring an assertion @int sum()@, which matches @sum@\(_0\) and terminates the recursion.
+Effectively, this algorithm traces as @sum(10, 20, 30)@ $\rightarrow$ @10 + sum(20, 30)@ $\rightarrow$ @10 + (20 + sum(30))@ $\rightarrow$ @10 + (20 + (30 + sum()))@ $\rightarrow$ @10 + (20 + (30 + 0))@.
+
+It is reasonable to take the @sum@ function a step further to enforce a minimum number of arguments:
+\begin{lstlisting}
+int sum( int x, int y ) { return x + y; }
+forall(ttype Params | { int sum( int, Params ); } ) int sum( int x, int y, Params rest ) {
+	return sum( x + y, rest );
+}
+\end{lstlisting}
+One more step permits the summation of any summable type with all arguments of the same type:
+\begin{lstlisting}
+trait summable(otype T) {
+	T ?+?( T, T );
+};
+forall(otype R | summable( R ) ) R sum( R x, R y ) {
+	return x + y;
+}
+forall(otype R, ttype Params | summable(R) | { R sum(R, Params); } ) R sum(R x, R y, Params rest) {
+	return sum( x + y, rest );
+}
+\end{lstlisting}
+Unlike C variadic functions, it is unnecessary to hard code the number and expected types.
+Furthermore, this code is extendable for any user-defined type with a @?+?@ operator.
+Summing arbitrary heterogeneous lists is possible with similar code by adding the appropriate type variables and addition operators.
+
+It is also possible to write a type-safe variadic print function to replace @printf@:
+\begin{lstlisting}
+struct S { int x, y; };
+forall(otype T, ttype Params | { void print(T); void print(Params); }) void print(T arg, Params rest) {
+	print(arg);  print(rest);
+}
+void print( char * x ) { printf( "%s", x ); }
+void print( int x ) { printf( "%d", x ); }
+void print( S s ) { print( "{ ", s.x, ",", s.y, " }" ); }
+print( "s = ", (S){ 1, 2 }, "\n" );
+\end{lstlisting}
+This example showcases a variadic-template-like decomposition of the provided argument list.
+The individual @print@ functions allow printing a single element of a type.
+The polymorphic @print@ allows printing any list of types, where as each individual type has a @print@ function.
+The individual print functions can be used to build up more complicated @print@ functions, such as @S@, which cannot be done with @printf@ in C.
+
+Finally, it is possible to use @ttype@ polymorphism to provide arbitrary argument forwarding functions.
+For example, it is possible to write @new@ as a library function:
+\begin{lstlisting}
+forall( otype R, otype S ) void ?{}( pair(R, S) *, R, S );
+forall( dtype T, ttype Params | sized(T) | { void ?{}( T *, Params ); } ) T * new( Params p ) {
+	return ((T *)malloc()){ p };			$\C{// construct into result of malloc}$
+}
+pair( int, char ) * x = new( 42, '!' );
+\end{lstlisting}
+The @new@ function provides the combination of type-safe @malloc@ with a \CFA constructor call, making it impossible to forget constructing dynamically allocated objects.
+This function provides the type-safety of @new@ in \CC, without the need to specify the allocated type again, thanks to return-type inference.
+
+
+\subsection{Implementation}
+
+Tuples are implemented in the \CFA translator via a transformation into \emph{generic types}.
+For each $N$, the first time an $N$-tuple is seen in a scope a generic type with $N$ type parameters is generated, \eg:
+\begin{lstlisting}
+[int, int] f() {
+	[double, double] x;
+	[int, double, int] y;
+}
+\end{lstlisting}
+is transformed into:
+\begin{lstlisting}
+forall(dtype T0, dtype T1 | sized(T0) | sized(T1)) struct _tuple2 {
+	T0 field_0;								$\C{// generated before the first 2-tuple}$
+	T1 field_1;
+};
+_tuple2(int, int) f() {
+	_tuple2(double, double) x;
+	forall(dtype T0, dtype T1, dtype T2 | sized(T0) | sized(T1) | sized(T2)) struct _tuple3 {
+		T0 field_0;							$\C{// generated before the first 3-tuple}$
+		T1 field_1;
+		T2 field_2;
+	};
+	_tuple3(int, double, int) y;
+}
+\end{lstlisting}
+\begin{sloppypar}
+Tuple expressions are then simply converted directly into compound literals, \eg @[5, 'x', 1.24]@ becomes @(_tuple3(int, char, double)){ 5, 'x', 1.24 }@.
+\end{sloppypar}
+
+\begin{comment}
+Since tuples are essentially structures, tuple indexing expressions are just field accesses:
+\begin{lstlisting}
+void f(int, [double, char]);
+[int, double] x;
+
+x.0+x.1;
+printf("%d %g\n", x);
+f(x, 'z');
+\end{lstlisting}
+Is transformed into:
+\begin{lstlisting}
+void f(int, _tuple2(double, char));
+_tuple2(int, double) x;
+
+x.field_0+x.field_1;
+printf("%d %g\n", x.field_0, x.field_1);
+f(x.field_0, (_tuple2){ x.field_1, 'z' });
+\end{lstlisting}
+Note that due to flattening, @x@ used in the argument position is converted into the list of its fields.
+In the call to @f@, the second and third argument components are structured into a tuple argument.
+Similarly, tuple member expressions are recursively expanded into a list of member access expressions.
+
+Expressions that may contain side effects are made into \emph{unique expressions} before being expanded by the flattening conversion.
+Each unique expression is assigned an identifier and is guaranteed to be executed exactly once:
+\begin{lstlisting}
+void g(int, double);
+[int, double] h();
+g(h());
+\end{lstlisting}
+Internally, this expression is converted to two variables and an expression:
+\begin{lstlisting}
+void g(int, double);
+[int, double] h();
+
+_Bool _unq0_finished_ = 0;
+[int, double] _unq0;
+g(
+	(_unq0_finished_ ? _unq0 : (_unq0 = f(), _unq0_finished_ = 1, _unq0)).0,
+	(_unq0_finished_ ? _unq0 : (_unq0 = f(), _unq0_finished_ = 1, _unq0)).1,
+);
+\end{lstlisting}
+Since argument evaluation order is not specified by the C programming language, this scheme is built to work regardless of evaluation order.
+The first time a unique expression is executed, the actual expression is evaluated and the accompanying boolean is set to true.
+Every subsequent evaluation of the unique expression then results in an access to the stored result of the actual expression.
+Tuple member expressions also take advantage of unique expressions in the case of possible impurity.
+
+Currently, the \CFA translator has a very broad, imprecise definition of impurity, where any function call is assumed to be impure.
+This notion could be made more precise for certain intrinsic, auto-generated, and builtin functions, and could analyze function bodies when they are available to recursively detect impurity, to eliminate some unique expressions.
+
+The various kinds of tuple assignment, constructors, and destructors generate GNU C statement expressions.
+A variable is generated to store the value produced by a statement expression, since its fields may need to be constructed with a non-trivial constructor and it may need to be referred to multiple time, \eg in a unique expression.
+The use of statement expressions allows the translator to arbitrarily generate additional temporary variables as needed, but binds the implementation to a non-standard extension of the C language.
+However, there are other places where the \CFA translator makes use of GNU C extensions, such as its use of nested functions, so this restriction is not new.
+\end{comment}
+
+
+\section{Evaluation}
+\label{sec:eval}
+
+Though \CFA provides significant added functionality over C, these features have a low runtime penalty.
+In fact, \CFA's features for generic programming can enable faster runtime execution than idiomatic @void *@-based C code.
+This claim is demonstrated through a set of generic-code-based micro-benchmarks in C, \CFA, and \CC (see stack implementations in Appendix~\ref{sec:BenchmarkStackImplementation}).
+Since all these languages share a subset essentially comprising standard C, maximal-performance benchmarks would show little runtime variance, other than in length and clarity of source code.
+A more illustrative benchmark measures the costs of idiomatic usage of each language's features.
+Figure~\ref{fig:BenchmarkTest} shows the \CFA benchmark tests for a generic stack based on a singly linked-list, a generic pair-data-structure, and a variadic @print@ routine similar to that in Section~\ref{sec:variadic-tuples}.
+The benchmark test is similar for C and \CC.
+The experiment uses element types @int@ and @pair(_Bool, char)@, and pushes $N=40M$ elements on a generic stack, copies the stack, clears one of the stacks, finds the maximum value in the other stack, and prints $N/2$ (to reduce graph height) constants.
+
+\begin{figure}
+\begin{lstlisting}[xleftmargin=3\parindentlnth,aboveskip=0pt,belowskip=0pt]
+int main( int argc, char * argv[] ) {
+	FILE * out = fopen( "cfa-out.txt", "w" );
+	int maxi = 0, vali = 42;
+	stack(int) si, ti;
+
+	REPEAT_TIMED( "push_int", N, push( &si, vali ); )
+	TIMED( "copy_int", ti = si; )
+	TIMED( "clear_int", clear( &si ); )
+	REPEAT_TIMED( "pop_int", N, 
+		int xi = pop( &ti ); if ( xi > maxi ) { maxi = xi; } )
+	REPEAT_TIMED( "print_int", N/2, print( out, vali, ":", vali, "\n" ); )
+
+	pair(_Bool, char) maxp = { (_Bool)0, '\0' }, valp = { (_Bool)1, 'a' };
+	stack(pair(_Bool, char)) sp, tp;
+
+	REPEAT_TIMED( "push_pair", N, push( &sp, valp ); )
+	TIMED( "copy_pair", tp = sp; )
+	TIMED( "clear_pair", clear( &sp ); )
+	REPEAT_TIMED( "pop_pair", N, 
+		pair(_Bool, char) xp = pop( &tp ); if ( xp > maxp ) { maxp = xp; } )
+	REPEAT_TIMED( "print_pair", N/2, print( out, valp, ":", valp, "\n" ); )
+	fclose(out);
+}
+\end{lstlisting}
+\caption{\protect\CFA Benchmark Test}
+\label{fig:BenchmarkTest}
+\end{figure}
+
+The structure of each benchmark implemented is: C with @void *@-based polymorphism, \CFA with the presented features, \CC with templates, and \CC using only class inheritance for polymorphism, called \CCV.
+The \CCV variant illustrates an alternative object-oriented idiom where all objects inherit from a base @object@ class, mimicking a Java-like interface;
+hence runtime checks are necessary to safely down-cast objects.
+The most notable difference among the implementations is in memory layout of generic types: \CFA and \CC inline the stack and pair elements into corresponding list and pair nodes, while C and \CCV lack such a capability and instead must store generic objects via pointers to separately-allocated objects.
+For the print benchmark, idiomatic printing is used: the C and \CFA variants used @stdio.h@, while the \CC and \CCV variants used @iostream@; preliminary tests show this distinction has negligible runtime impact.
+Note, the C benchmark uses unchecked casts as there is no runtime mechanism to perform such checks, while \CFA and \CC provide type-safety statically.
+
+Figure~\ref{fig:eval} and Table~\ref{tab:eval} show the results of running the benchmark in Figure~\ref{fig:BenchmarkTest} and its C, \CC, and \CCV equivalents. 
+The graph plots the median of 5 consecutive runs of each program, with an initial warm-up run omitted.
+All code is compiled at \texttt{-O2} by GCC or G++ 6.2.0, with all \CC code compiled as \CCfourteen.
+The benchmarks are run on an Ubuntu 16.04 workstation with 16 GB of RAM and a 6-core AMD FX-6300 CPU with 3.5 GHz maximum clock frequency.
+
+\begin{figure}
+\centering
+\input{timing}
+\caption{Benchmark Timing Results (smaller is better)}
+\label{fig:eval}
+\end{figure}
+
+\begin{table}
+\caption{Properties of benchmark code}
+\label{tab:eval}
+\newcommand{\CT}[1]{\multicolumn{1}{c}{#1}}
+\begin{tabular}{rrrrr}
+									& \CT{C}	& \CT{\CFA}	& \CT{\CC}	& \CT{\CCV}		\\ \hline
+maximum memory usage (MB)			& 10001		& 2502		& 2503		& 11253			\\
+source code size (lines)			& 247		& 222		& 165		& 339			\\
+redundant type annotations (lines)	& 39		& 2			& 2			& 15			\\
+binary size (KB)					& 14		& 229		& 18		& 38			\\
+\end{tabular}
+\end{table}
+
+The C and \CCV variants are generally the slowest with the largest memory footprint, because of their less-efficient memory layout and the pointer-indirection necessary to implement generic types;
+this inefficiency is exacerbated by the second level of generic types in the pair-based benchmarks.
+By contrast, the \CFA and \CC variants run in roughly equivalent time for both the integer and pair of @_Bool@ and @char@ because the storage layout is equivalent, with the inlined libraries (\ie no separate compilation) and greater maturity of the \CC compiler contributing to its lead.
+\CCV is slower than C largely due to the cost of runtime type-checking of down-casts (implemented with @dynamic_cast@);
+There are two outliers in the graph for \CFA: all prints and pop of @pair@.
+Both of these cases result from the complexity of the C-generated polymorphic code, so that the GCC compiler is unable to optimize some dead code and condense nested calls.
+A compiler designed for \CFA could easily perform these optimizations.
+Finally, the binary size for \CFA is larger because of static linking with the \CFA libraries.
+
+\CFA is also competitive in terms of source code size, measured as a proxy for programmer effort. The line counts in Table~\ref{tab:eval} include implementations of @pair@ and @stack@ types for all four languages for purposes of direct comparison, though it should be noted that \CFA and \CC have pre-written data structures in their standard libraries that programmers would generally use instead. Use of these standard library types has minimal impact on the performance benchmarks, but shrinks the \CFA and \CC benchmarks to 73 and 54 lines, respectively. 
+On the other hand, C does not have a generic collections-library in its standard distribution, resulting in frequent reimplementation of such collection types by C programmers.
+\CCV does not use the \CC standard template library by construction, and in fact includes the definition of @object@ and wrapper classes for @bool@, @char@, @int@, and @const char *@ in its line count, which inflates this count somewhat, as an actual object-oriented language would include these in the standard library; 
+with their omission, the \CCV line count is similar to C.
+We justify the given line count by noting that many object-oriented languages do not allow implementing new interfaces on library types without subclassing or wrapper types, which may be similarly verbose.
+
+Raw line-count, however, is a fairly rough measure of code complexity;
+another important factor is how much type information the programmer must manually specify, especially where that information is not checked by the compiler.
+Such unchecked type information produces a heavier documentation burden and increased potential for runtime bugs, and is much less common in \CFA than C, with its manually specified function pointers arguments and format codes, or \CCV, with its extensive use of un-type-checked downcasts (\eg @object@ to @integer@ when popping a stack, or @object@ to @printable@ when printing the elements of a @pair@).
+To quantify this, the ``redundant type annotations'' line in Table~\ref{tab:eval} counts the number of lines on which the type of a known variable is re-specified, either as a format specifier, explicit downcast, type-specific function, or by name in a @sizeof@, struct literal, or @new@ expression.
+The \CC benchmark uses two redundant type annotations to create a new stack nodes, while the C and \CCV benchmarks have several such annotations spread throughout their code.
+The two instances in which the \CFA benchmark still uses redundant type specifiers are to cast the result of a polymorphic @malloc@ call (the @sizeof@ argument is inferred by the compiler).
+These uses are similar to the @new@ expressions in \CC, though the \CFA compiler's type resolver should shortly render even these type casts superfluous.
+
+
+\section{Related Work}
+
+
+\subsection{Polymorphism}
+
+\CC is the most similar language to \CFA;
+both are extensions to C with source and runtime backwards compatibility.
+The fundamental difference is in their engineering approach to C compatibility and programmer expectation.
+While \CC provides good backwards compatibility with C, it has a steep learning curve for many of its extensions.
+For example, polymorphism is provided via three disjoint mechanisms: overloading, inheritance, and templates.
+The overloading is restricted because resolution does not use the return type, inheritance requires learning object-oriented programming and coping with a restricted nominal-inheritance hierarchy, templates cannot be separately compiled resulting in compilation/code bloat and poor error messages, and determining how these mechanisms interact and which to use is confusing.
+In contrast, \CFA has a single facility for polymorphic code supporting type-safe separate-compilation of polymorphic functions and generic (opaque) types, which uniformly leverage the C procedural paradigm.
+The key mechanism to support separate compilation is \CFA's \emph{explicit} use of assumed properties for a type.
+Until \CC~\cite{C++Concepts} are standardized (anticipated for \CCtwenty), \CC provides no way to specify the requirements of a generic function in code beyond compilation errors during template expansion;
+furthermore, \CC concepts are restricted to template polymorphism.
+
+Cyclone~\cite{Grossman06} also provides capabilities for polymorphic functions and existential types, similar to \CFA's @forall@ functions and generic types.
+Cyclone existential types can include function pointers in a construct similar to a virtual function-table, but these pointers must be explicitly initialized at some point in the code, a tedious and potentially error-prone process.
+Furthermore, Cyclone's polymorphic functions and types are restricted to abstraction over types with the same layout and calling convention as @void *@, \ie only pointer types and @int@.
+In \CFA terms, all Cyclone polymorphism must be dtype-static.
+While the Cyclone design provides the efficiency benefits discussed in Section~\ref{sec:generic-apps} for dtype-static polymorphism, it is more restrictive than \CFA's general model.
+\cite{Smith98} present Polymorphic C, an ML dialect with polymorphic functions and C-like syntax and pointer types; it lacks many of C's features, however, most notably structure types, and so is not a practical C replacement.
+
+\cite{obj-c-book} is an industrially successful extension to C.
+However, Objective-C is a radical departure from C, using an object-oriented model with message-passing.
+Objective-C did not support type-checked generics until recently \cite{xcode7}, historically using less-efficient runtime checking of object types.
+The~\cite{GObject} framework also adds object-oriented programming with runtime type-checking and reference-counting garbage-collection to C;
+these features are more intrusive additions than those provided by \CFA, in addition to the runtime overhead of reference-counting.
+\cite{Vala} compiles to GObject-based C, adding the burden of learning a separate language syntax to the aforementioned demerits of GObject as a modernization path for existing C code-bases.
+Java~\cite{Java8} included generic types in Java~5, which are type-checked at compilation and type-erased at runtime, similar to \CFA's.
+However, in Java, each object carries its own table of method pointers, while \CFA passes the method pointers separately to maintain a C-compatible layout.
+Java is also a garbage-collected, object-oriented language, with the associated resource usage and C-interoperability burdens.
+
+D~\cite{D}, Go, and~\cite{Rust} are modern, compiled languages with abstraction features similar to \CFA traits, \emph{interfaces} in D and Go and \emph{traits} in Rust.
+However, each language represents a significant departure from C in terms of language model, and none has the same level of compatibility with C as \CFA.
+D and Go are garbage-collected languages, imposing the associated runtime overhead.
+The necessity of accounting for data transfer between managed runtimes and the unmanaged C runtime complicates foreign-function interfaces to C.
+Furthermore, while generic types and functions are available in Go, they are limited to a small fixed set provided by the compiler, with no language facility to define more.
+D restricts garbage collection to its own heap by default, while Rust is not garbage-collected, and thus has a lighter-weight runtime more interoperable with C.
+Rust also possesses much more powerful abstraction capabilities for writing generic code than Go.
+On the other hand, Rust's borrow-checker provides strong safety guarantees but is complex and difficult to learn and imposes a distinctly idiomatic programming style.
+\CFA, with its more modest safety features, allows direct ports of C code while maintaining the idiomatic style of the original source.
+
+
+\subsection{Tuples/Variadics}
+
+Many programming languages have some form of tuple construct and/or variadic functions, \eg SETL, C, KW-C, \CC, D, Go, Java, ML, and Scala.
+SETL~\cite{SETL} is a high-level mathematical programming language, with tuples being one of the primary data types.
+Tuples in SETL allow subscripting, dynamic expansion, and multiple assignment.
+C provides variadic functions through @va_list@ objects, but the programmer is responsible for managing the number of arguments and their types, so the mechanism is type unsafe.
+KW-C~\cite{Buhr94a}, a predecessor of \CFA, introduced tuples to C as an extension of the C syntax, taking much of its inspiration from SETL.
+The main contributions of that work were adding MRVF, tuple mass and multiple assignment, and record-field access.
+\CCeleven introduced @std::tuple@ as a library variadic template structure.
+Tuples are a generalization of @std::pair@, in that they allow for arbitrary length, fixed-size aggregation of heterogeneous values.
+Operations include @std::get<N>@ to extract vales, @std::tie@ to create a tuple of references used for assignment, and lexicographic comparisons.
+\CCseventeen proposes \emph{structured bindings}~\cite{Sutter15} to eliminate pre-declaring variables and use of @std::tie@ for binding the results.
+This extension requires the use of @auto@ to infer the types of the new variables, so complicated expressions with a non-obvious type must be documented with some other mechanism.
+Furthermore, structured bindings are not a full replacement for @std::tie@, as it always declares new variables.
+Like \CC, D provides tuples through a library variadic-template structure.
+Go does not have tuples but supports MRVF.
+Java's variadic functions appear similar to C's but are type-safe using homogeneous arrays, which are less useful than \CFA's heterogeneously-typed variadic functions.
+Tuples are a fundamental abstraction in most functional programming languages, such as Standard ML~\cite{sml} and~\cite{Scala}, which decompose tuples using pattern matching.
+
+
+\section{Conclusion and Future Work}
+
+The goal of \CFA is to provide an evolutionary pathway for large C development-environments to be more productive and safer, while respecting the talent and skill of C programmers.
+While other programming languages purport to be a better C, they are in fact new and interesting languages in their own right, but not C extensions.
+The purpose of this paper is to introduce \CFA, and showcase two language features that illustrate the \CFA type-system and approaches taken to achieve the goal of evolutionary C extension.
+The contributions are a powerful type-system using parametric polymorphism and overloading, generic types, and tuples, which all have complex interactions.
+The work is a challenging design, engineering, and implementation exercise.
+On the surface, the project may appear as a rehash of similar mechanisms in \CC.
+However, every \CFA feature is different than its \CC counterpart, often with extended functionality, better integration with C and its programmers, and always supporting separate compilation.
+All of these new features are being used by the \CFA development-team to build the \CFA runtime-system.
+Finally, we demonstrate that \CFA performance for some idiomatic cases is better than C and close to \CC, showing the design is practically applicable.
+
+There is ongoing work on a wide range of \CFA feature extensions, including reference types, arrays with size, exceptions, concurrent primitives and modules.
+(While all examples in the paper compile and run, a public beta-release of \CFA will take another 8--12 months to finalize these additional extensions.)
+In addition, there are interesting future directions for the polymorphism design.
+Notably, \CC template functions trade compile time and code bloat for optimal runtime of individual instantiations of polymorphic functions.
+\CFA polymorphic functions use dynamic virtual-dispatch; 
+the runtime overhead of this approach is low, but not as low as inlining, and it may be beneficial to provide a mechanism for performance-sensitive code.
+Two promising approaches are an @inline@ annotation at polymorphic function call sites to create a template-specialization of the function (provided the code is visible) or placing an @inline@ annotation on polymorphic function-definitions to instantiate a specialized version for some set of types (\CC template specialization).
+These approaches are not mutually exclusive and allow performance optimizations to be applied only when necessary, without suffering global code-bloat.
+In general, we believe separate compilation, producing smaller code, works well with loaded hardware-caches, which may offset the benefit of larger inlined-code.
+
+
+\section{Acknowledgments}
+
+The authors would like to recognize the design assistance of Glen Ditchfield, Richard Bilson, and Thierry Delisle on the features described in this paper, and thank Magnus Madsen and the three anonymous reviewers for valuable feedback.
+%This work is supported in part by a corporate partnership with \grantsponsor{Huawei}{Huawei Ltd.}{http://www.huawei.com}, and Aaron Moss and Peter Buhr are funded by the \grantsponsor{Natural Sciences and Engineering Research Council} of Canada.
+% the first author's \grantsponsor{NSERC-PGS}{NSERC PGS D}{http://www.nserc-crsng.gc.ca/Students-Etudiants/PG-CS/BellandPostgrad-BelletSuperieures_eng.asp} scholarship.
+
+
+\bibliographystyle{plain}
+\bibliography{cfa}
+
+
+\appendix
+
+\section{Benchmark Stack Implementation}
+\label{sec:BenchmarkStackImplementation}
+
+\lstset{basicstyle=\linespread{0.9}\sf\small}
+
+Throughout, @/***/@ designates a counted redundant type annotation.
+
+\smallskip\noindent
+\CFA
+\begin{lstlisting}[xleftmargin=2\parindentlnth,aboveskip=0pt,belowskip=0pt]
+forall(otype T) struct stack_node {
+	T value;
+	stack_node(T) * next;
+};
+forall(otype T) void ?{}(stack(T) * s) { (&s->head){ 0 }; }
+forall(otype T) void ?{}(stack(T) * s, stack(T) t) {
+	stack_node(T) ** crnt = &s->head;
+	for ( stack_node(T) * next = t.head; next; next = next->next ) {
+		*crnt = ((stack_node(T) *)malloc()){ next->value }; /***/
+		stack_node(T) * acrnt = *crnt;
+		crnt = &acrnt->next;
+	}
+	*crnt = 0;
+}
+forall(otype T) stack(T) ?=?(stack(T) * s, stack(T) t) {
+	if ( s->head == t.head ) return *s;
+	clear(s);
+	s{ t };
+	return *s;
+}
+forall(otype T) void ^?{}(stack(T) * s) { clear(s); }
+forall(otype T) _Bool empty(const stack(T) * s) { return s->head == 0; }
+forall(otype T) void push(stack(T) * s, T value) {
+	s->head = ((stack_node(T) *)malloc()){ value, s->head }; /***/
+}
+forall(otype T) T pop(stack(T) * s) {
+	stack_node(T) * n = s->head;
+	s->head = n->next;
+	T x = n->value;
+	^n{};
+	free(n);
+	return x;
+}
+forall(otype T) void clear(stack(T) * s) {
+	for ( stack_node(T) * next = s->head; next; ) {
+		stack_node(T) * crnt = next;
+		next = crnt->next;
+		delete(crnt);
+	}
+	s->head = 0;
+}
+\end{lstlisting}
+
+\medskip\noindent
+\CC
+\begin{lstlisting}[xleftmargin=2\parindentlnth,aboveskip=0pt,belowskip=0pt]
+template<typename T> class stack {
+	struct node {
+		T value;
+		node * next;
+		node( const T & v, node * n = nullptr ) : value(v), next(n) {}
+	};
+	node * head;
+	void copy(const stack<T>& o) {
+		node ** crnt = &head;
+		for ( node * next = o.head;; next; next = next->next ) {
+			*crnt = new node{ next->value }; /***/
+			crnt = &(*crnt)->next;
+		}
+		*crnt = nullptr;
+	}
+  public:
+	stack() : head(nullptr) {}
+	stack(const stack<T>& o) { copy(o); }
+	stack(stack<T> && o) : head(o.head) { o.head = nullptr; }
+	~stack() { clear(); }
+	stack & operator= (const stack<T>& o) {
+		if ( this == &o ) return *this;
+		clear();
+		copy(o);
+		return *this;
+	}
+	stack & operator= (stack<T> && o) {
+		if ( this == &o ) return *this;
+		head = o.head;
+		o.head = nullptr;
+		return *this;
+	}
+	bool empty() const { return head == nullptr; }
+	void push(const T & value) { head = new node{ value, head };  /***/ }
+	T pop() {
+		node * n = head;
+		head = n->next;
+		T x = std::move(n->value);
+		delete n;
+		return x;
+	}
+	void clear() {
+		for ( node * next = head; next; ) {
+			node * crnt = next;
+			next = crnt->next;
+			delete crnt;
+		}
+		head = nullptr;
+	}
+};
+\end{lstlisting}
+
+\medskip\noindent
+C
+\begin{lstlisting}[xleftmargin=2\parindentlnth,aboveskip=0pt,belowskip=0pt]
+struct stack_node {
+	void * value;
+	struct stack_node * next;
+};
+struct stack new_stack() { return (struct stack){ NULL }; /***/ }
+void copy_stack(struct stack * s, const struct stack * t, void * (*copy)(const void *)) {
+	struct stack_node ** crnt = &s->head;
+	for ( struct stack_node * next = t->head; next; next = next->next ) {
+		*crnt = malloc(sizeof(struct stack_node)); /***/
+		**crnt = (struct stack_node){ copy(next->value) }; /***/
+		crnt = &(*crnt)->next;
+	}
+	*crnt = 0;
+}
+_Bool stack_empty(const struct stack * s) { return s->head == NULL; }
+void push_stack(struct stack * s, void * value) {
+	struct stack_node * n = malloc(sizeof(struct stack_node)); /***/
+	*n = (struct stack_node){ value, s->head }; /***/
+	s->head = n;
+}
+void * pop_stack(struct stack * s) {
+	struct stack_node * n = s->head;
+	s->head = n->next;
+	void * x = n->value;
+	free(n);
+	return x;
+}
+void clear_stack(struct stack * s, void (*free_el)(void *)) {
+	for ( struct stack_node * next = s->head; next; ) {
+		struct stack_node * crnt = next;
+		next = crnt->next;
+		free_el(crnt->value);
+		free(crnt);
+	}
+	s->head = NULL;
+}
+\end{lstlisting}
+
+\medskip\noindent
+\CCV
+\begin{lstlisting}[xleftmargin=2\parindentlnth,aboveskip=0pt,belowskip=0pt]
+stack::node::node( const object & v, node * n ) : value( v.new_copy() ), next( n ) {}
+void stack::copy(const stack & o) {
+	node ** crnt = &head;
+	for ( node * next = o.head; next; next = next->next ) {
+		*crnt = new node{ *next->value };
+		crnt = &(*crnt)->next;
+	}
+	*crnt = nullptr;
+}
+stack::stack() : head(nullptr) {}
+stack::stack(const stack & o) { copy(o); }
+stack::stack(stack && o) : head(o.head) { o.head = nullptr; }
+stack::~stack() { clear(); }
+stack & stack::operator= (const stack & o) {
+	if ( this == &o ) return *this;
+	clear();
+	copy(o);
+	return *this;
+}
+stack & stack::operator= (stack && o) {
+	if ( this == &o ) return *this;
+	head = o.head;
+	o.head = nullptr;
+	return *this;
+}
+bool stack::empty() const { return head == nullptr; }
+void stack::push(const object & value) { head = new node{ value, head }; /***/ }
+ptr<object> stack::pop() {
+	node * n = head;
+	head = n->next;
+	ptr<object> x = std::move(n->value);
+	delete n;
+	return x;
+}
+void stack::clear() {
+	for ( node * next = head; next; ) {
+		node * crnt = next;
+		next = crnt->next;
+		delete crnt;
+	}
+	head = nullptr;
+}
+\end{lstlisting}
+
+
+\begin{comment}
+
+\subsubsection{bench.h}
+(\texttt{bench.hpp} is similar.)
+
+\lstinputlisting{evaluation/bench.h}
+
+\subsection{C}
+
+\subsubsection{c-stack.h} ~
+
+\lstinputlisting{evaluation/c-stack.h}
+
+\subsubsection{c-stack.c} ~
+
+\lstinputlisting{evaluation/c-stack.c}
+
+\subsubsection{c-pair.h} ~
+
+\lstinputlisting{evaluation/c-pair.h}
+
+\subsubsection{c-pair.c} ~
+
+\lstinputlisting{evaluation/c-pair.c}
+
+\subsubsection{c-print.h} ~
+
+\lstinputlisting{evaluation/c-print.h}
+
+\subsubsection{c-print.c} ~
+
+\lstinputlisting{evaluation/c-print.c}
+
+\subsubsection{c-bench.c} ~
+
+\lstinputlisting{evaluation/c-bench.c}
+
+\subsection{\CFA}
+
+\subsubsection{cfa-stack.h} ~
+
+\lstinputlisting{evaluation/cfa-stack.h}
+
+\subsubsection{cfa-stack.c} ~
+
+\lstinputlisting{evaluation/cfa-stack.c}
+
+\subsubsection{cfa-print.h} ~
+
+\lstinputlisting{evaluation/cfa-print.h}
+
+\subsubsection{cfa-print.c} ~
+
+\lstinputlisting{evaluation/cfa-print.c}
+
+\subsubsection{cfa-bench.c} ~
+
+\lstinputlisting{evaluation/cfa-bench.c}
+
+\subsection{\CC}
+
+\subsubsection{cpp-stack.hpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-stack.hpp}
+
+\subsubsection{cpp-print.hpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-print.hpp}
+
+\subsubsection{cpp-bench.cpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-bench.cpp}
+
+\subsection{\CCV}
+
+\subsubsection{object.hpp} ~
+
+\lstinputlisting[language=c++]{evaluation/object.hpp}
+
+\subsubsection{cpp-vstack.hpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-vstack.hpp}
+
+\subsubsection{cpp-vstack.cpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-vstack.cpp}
+
+\subsubsection{cpp-vprint.hpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-vprint.hpp}
+
+\subsubsection{cpp-vbench.cpp} ~
+
+\lstinputlisting[language=c++]{evaluation/cpp-vbench.cpp}
+\end{comment}
+
+\end{document}
+
+% Local Variables: %
+% tab-width: 4 %
+% compile-command: "make" %
+% End: %
Index: doc/papers/general/evaluation/.gitignore
===================================================================
--- doc/papers/general/evaluation/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/.gitignore	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,6 @@
+c-bench
+cpp-bench
+cpp-vbench
+cfa-bench
+*.o
+*.d
Index: doc/papers/general/evaluation/bench.h
===================================================================
--- doc/papers/general/evaluation/bench.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/bench.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,15 @@
+#pragma once
+#include <stdio.h>
+#include <time.h>
+
+long ms_between(clock_t start, clock_t end) { return (end - start) / (CLOCKS_PER_SEC / 1000); }
+
+#define N 40000000
+#define TIMED(name, code) { \
+	volatile clock_t _start, _end; \
+	_start = clock(); \
+	code \
+	_end = clock(); \
+	printf("%s:\t%8ld ms\n", name, ms_between(_start, _end)); \
+}
+#define REPEAT_TIMED(name, n, code) TIMED( name, for (int _i = 0; _i < n; ++_i) { code } )
Index: doc/papers/general/evaluation/bench.hpp
===================================================================
--- doc/papers/general/evaluation/bench.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/bench.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,17 @@
+#pragma once
+#include <iomanip>
+#include <iostream>
+#include <time.h>
+
+long ms_between(clock_t start, clock_t end) { return (end - start) / (CLOCKS_PER_SEC / 1000); }
+
+static const int N = 40000000;
+#define TIMED(name, code) { \
+	volatile clock_t _start, _end; \
+	_start = clock(); \
+	code \
+	_end = clock(); \
+	std::cout << name << ":\t" << std::setw(8) << ms_between(_start, _end) \
+		<< std::setw(0) << " ms" << std::endl; \
+}
+#define REPEAT_TIMED(name, n, code) TIMED( name, for (int _i = 0; _i < n; ++_i) { code } )
Index: doc/papers/general/evaluation/c-bench.c
===================================================================
--- doc/papers/general/evaluation/c-bench.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-bench.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,73 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include "bench.h"
+#include "c-pair.h"
+#include "c-stack.h"
+#include "c-print.h"
+
+_Bool* new_bool( _Bool b ) {
+	_Bool* q = malloc(sizeof(_Bool)); /***/
+	*q = b;
+	return q;
+}
+
+char* new_char( char c ) {
+	char* q = malloc(sizeof(char)); /***/
+	*q = c;
+	return q;
+}
+
+int* new_int( int i ) {
+	int* q = malloc(sizeof(int)); /***/
+	*q = i;
+	return q;
+}
+
+void* copy_bool( const void* p ) { return new_bool( *(const _Bool*)p ); } /***/
+void* copy_char( const void* p ) { return new_char( *(const char*)p ); } /***/
+void* copy_int( const void* p ) { return new_int( *(const int*)p ); } /***/
+void* copy_pair_bool_char( const void* p ) { return copy_pair( p, copy_bool, copy_char ); } /***/
+void free_pair_bool_char( void* p ) { free_pair( p, free, free ); } /***/
+
+int cmp_bool( const void* a, const void* b ) { /***/
+	return *(const _Bool*)a == *(const _Bool*)b ? 0 : *(const _Bool*)a < *(const _Bool*)b ? -1 : 1; 
+}
+
+int cmp_char( const void* a, const void* b ) { /***/
+	return *(const char*)a == *(const char*)b ? 0 : *(const char*)a < *(const char*)b ? -1 : 1;
+}
+
+int main(int argc, char** argv) {
+	FILE * out = fopen("/dev/null", "w");
+	int maxi = 0, vali = 42;
+	struct stack si = new_stack(), ti;
+
+	REPEAT_TIMED( "push_int", N, push_stack( &si, new_int( vali ) ); )
+	TIMED( "copy_int", 	copy_stack( &ti, &si, copy_int ); /***/ )
+	TIMED( "clear_int", clear_stack( &si, free ); /***/ )
+	REPEAT_TIMED( "pop_int", N, 
+		int* xi = pop_stack( &ti );
+		if ( *xi > maxi ) { maxi = *xi; }
+		free(xi); )
+	REPEAT_TIMED( "print_int", N/2, print( out, "dsds", vali, ":", vali, "\n" ); /***/ )
+
+	struct pair * maxp = new_pair( new_bool(0), new_char('\0') ),
+		* valp = new_pair( new_bool(1), new_char('a') );
+	struct stack sp = new_stack(), tp;
+
+	REPEAT_TIMED( "push_pair", N, push_stack( &sp, copy_pair_bool_char( valp ) ); )
+	TIMED( "copy_pair", copy_stack( &tp, &sp, copy_pair_bool_char ); /***/ )
+	TIMED( "clear_pair", clear_stack( &sp, free_pair_bool_char ); /***/ )
+	REPEAT_TIMED( "pop_pair", N, 
+		struct pair * xp = pop_stack( &tp );
+		if ( cmp_pair( xp, maxp, cmp_bool, cmp_char /***/ ) > 0 ) {
+			free_pair_bool_char( maxp ); /***/
+			maxp = xp;
+		} else {
+			free_pair_bool_char( xp ); /***/
+		} )
+	REPEAT_TIMED( "print_pair", N/2, print( out, "pbcspbcs", *valp, ":", *valp, "\n" ); /***/ )
+	free_pair_bool_char( maxp ); /***/
+	free_pair_bool_char( valp ); /***/
+	fclose(out);
+}
Index: doc/papers/general/evaluation/c-pair.c
===================================================================
--- doc/papers/general/evaluation/c-pair.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-pair.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,26 @@
+#include <stdlib.h>
+#include "c-pair.h"
+
+struct pair* new_pair(void* first, void* second) {
+	struct pair* p = malloc(sizeof(struct pair)); /***/
+	*p = (struct pair){ first, second }; /***/
+	return p;
+}
+
+struct pair* copy_pair(const struct pair* src, 
+		void* (*copy_first)(const void*), void* (*copy_second)(const void*)) {
+	return new_pair( copy_first(src->first), copy_second(src->second) );
+}
+
+void free_pair(struct pair* p, void (*free_first)(void*), void (*free_second)(void*)) {
+	free_first(p->first);
+	free_second(p->second);
+	free(p);
+}
+
+int cmp_pair(const struct pair* a, const struct pair* b, 
+		int (*cmp_first)(const void*, const void*), int (*cmp_second)(const void*, const void*)) {
+	int c = cmp_first(a->first, b->first);
+	if ( c == 0 ) c = cmp_second(a->second, b->second);
+	return c;
+}
Index: doc/papers/general/evaluation/c-pair.h
===================================================================
--- doc/papers/general/evaluation/c-pair.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-pair.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,16 @@
+#pragma once
+
+struct pair {
+	void* first;
+	void* second;
+};
+
+struct pair* new_pair(void* first, void* second);
+
+struct pair* copy_pair(const struct pair* src, 
+	void* (*copy_first)(const void*), void* (*copy_second)(const void*));
+
+void free_pair(struct pair* p, void (*free_first)(void*), void (*free_second)(void*));
+
+int cmp_pair(const struct pair* a, const struct pair* b, 
+	int (*cmp_first)(const void*, const void*), int (*cmp_second)(const void*, const void*));
Index: doc/papers/general/evaluation/c-print.c
===================================================================
--- doc/papers/general/evaluation/c-print.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-print.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,47 @@
+#include <stdarg.h>
+#include <stdio.h>
+#include "c-pair.h"
+#include "c-print.h"
+
+void print_string(FILE* out, const char* x) { fprintf(out, "%s", x); }
+
+void print_bool(FILE* out, _Bool x) { fprintf(out, "%s", x ? "true" : "false"); }
+
+void print_char(FILE* out, char x) {
+	if ( 0x20 <= x && x <= 0x7E ) { fprintf(out, "'%c'", x); }
+	else { fprintf(out, "'\\%x'", x); }
+}
+
+void print_int(FILE* out, int x) { fprintf(out, "%d", x); }
+
+void print_fmt(FILE* out, char fmt, void* p) {
+	switch( fmt ) {
+	case 's': print_string(out, (const char*)p); break; /***/
+	case 'b': print_bool(out, *(_Bool*)p); break; /***/
+	case 'c': print_char(out, *(char*)p); break; /***/
+	case 'd': print_int(out, *(int*)p); break; /***/
+	}
+}
+
+void print(FILE* out, const char* fmt, ...) {
+	va_list args;
+	va_start(args, fmt);
+	for (const char* it = fmt; *it; ++it) {
+		switch( *it ) {
+		case 's': print_string(out, va_arg(args, const char*)); break; /***/
+		case 'b': print_bool(out, va_arg(args, int)); break; /***/
+		case 'c': print_char(out, va_arg(args, int)); break; /***/
+		case 'd': print_int(out, va_arg(args, int)); break; /***/
+		case 'p': {
+			const struct pair x = va_arg(args, const struct pair); /***/
+			fprintf(out, "[");
+			print_fmt(out, *++it, x.first); /***/
+			fprintf(out, ", ");
+			print_fmt(out, *++it, x.second); /***/
+			fprintf(out, "]");
+			break;
+		}
+		}
+	}
+	va_end(args);
+}
Index: doc/papers/general/evaluation/c-print.h
===================================================================
--- doc/papers/general/evaluation/c-print.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-print.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,9 @@
+#pragma once
+#include <stdio.h>
+
+void print_string(FILE* out, const char* x);
+void print_bool(FILE* out, _Bool x);
+void print_char(FILE* out, char x);
+void print_int(FILE* out, int x);
+
+void print(FILE* out, const char* fmt, ...);
Index: doc/papers/general/evaluation/c-stack.c
===================================================================
--- doc/papers/general/evaluation/c-stack.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-stack.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,45 @@
+#include <stdlib.h>
+#include "c-stack.h"
+
+struct stack_node {
+	void* value;
+	struct stack_node* next;
+};
+
+struct stack new_stack() { return (struct stack){ NULL }; /***/ }
+
+void copy_stack(struct stack* s, const struct stack* t, void* (*copy)(const void*)) {
+	struct stack_node** crnt = &s->head;
+	for ( struct stack_node* next = t->head; next; next = next->next ) {
+		*crnt = malloc(sizeof(struct stack_node)); /***/
+		**crnt = (struct stack_node){ copy(next->value) }; /***/
+		crnt = &(*crnt)->next;
+	}
+	*crnt = 0;
+}
+
+void clear_stack(struct stack* s, void (*free_el)(void*)) {
+    for ( struct stack_node* next = s->head; next; ) {
+		struct stack_node* crnt = next;
+		next = crnt->next;
+		free_el(crnt->value);
+		free(crnt);
+	}
+	s->head = NULL;
+}
+
+_Bool stack_empty(const struct stack* s) { return s->head == NULL; }
+
+void push_stack(struct stack* s, void* value) {
+	struct stack_node* n = malloc(sizeof(struct stack_node)); /***/
+	*n = (struct stack_node){ value, s->head }; /***/
+	s->head = n;
+}
+
+void* pop_stack(struct stack* s) {
+	struct stack_node* n = s->head;
+	s->head = n->next;
+	void* x = n->value;
+	free(n);
+	return x;
+}
Index: doc/papers/general/evaluation/c-stack.h
===================================================================
--- doc/papers/general/evaluation/c-stack.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/c-stack.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,14 @@
+#pragma once
+
+struct stack_node;
+struct stack {
+	struct stack_node* head;
+};
+
+struct stack new_stack();
+void copy_stack(struct stack* dst, const struct stack* src, void* (*copy)(const void*));
+void clear_stack(struct stack* s, void (*free_el)(void*));
+
+_Bool stack_empty(const struct stack* s);
+void push_stack(struct stack* s, void* value);
+void* pop_stack(struct stack* s);
Index: doc/papers/general/evaluation/cfa-bench.c
===================================================================
--- doc/papers/general/evaluation/cfa-bench.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-bench.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,31 @@
+#include <stdio.h>
+#include "bench.h"
+#include "cfa-stack.h"
+#include "cfa-pair.h"
+#include "cfa-print.h"
+
+int main( int argc, char *argv[] ) {
+	FILE * out = fopen( "/dev/null", "w" );
+	int maxi = 0, vali = 42;
+	stack(int) si, ti;
+
+	REPEAT_TIMED( "push_int", N, push( &si, vali ); )
+	TIMED( "copy_int", ti = si; )
+	TIMED( "clear_int", clear( &si ); )
+	REPEAT_TIMED( "pop_int", N, 
+		int xi = pop( &ti ); 
+		if ( xi > maxi ) { maxi = xi; } )
+	REPEAT_TIMED( "print_int", N/2, print( out, vali, ":", vali, "\n" ); )
+
+	pair(_Bool, char) maxp = { (_Bool)0, '\0' }, valp = { (_Bool)1, 'a' };
+	stack(pair(_Bool, char)) sp, tp;
+
+	REPEAT_TIMED( "push_pair", N, push( &sp, valp ); )
+	TIMED( "copy_pair", tp = sp; )
+	TIMED( "clear_pair", clear( &sp ); )
+	REPEAT_TIMED( "pop_pair", N, 
+		pair(_Bool, char) xp = pop( &tp ); 
+		if ( xp > maxp ) { maxp = xp; } )
+	REPEAT_TIMED( "print_pair", N/2, print( out, valp, ":", valp, "\n" ); )
+	fclose(out);
+}
Index: doc/papers/general/evaluation/cfa-pair.c
===================================================================
--- doc/papers/general/evaluation/cfa-pair.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-pair.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,35 @@
+#include "cfa-pair.h"
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?<?(R, R); int ?<?(S, S); })
+int ?<?(pair(R, S) p, pair(R, S) q) {
+	return p.first < q.first || ( p.first == q.first && p.second < q.second );
+}
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?<?(R, R); int ?<=?(S, S); })
+int ?<=?(pair(R, S) p, pair(R, S) q) {
+	return p.first < q.first || ( p.first == q.first && p.second <= q.second );
+}
+
+forall(otype R, otype S | { int ?==?(R, R); int ?==?(S, S); })
+int ?==?(pair(R, S) p, pair(R, S) q) {
+	return p.first == q.first && p.second == q.second;
+}
+
+forall(otype R, otype S | { int ?!=?(R, R); int ?!=?(S, S); })
+int ?!=?(pair(R, S) p, pair(R, S) q) {
+	return p.first != q.first || p.second != q.second;
+}
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?>?(R, R); int ?>?(S, S); })
+int ?>?(pair(R, S) p, pair(R, S) q) {
+	return p.first > q.first || ( p.first == q.first && p.second > q.second );
+}
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?>?(R, R); int ?>=?(S, S); })
+int ?>=?(pair(R, S) p, pair(R, S) q) {
+	return p.first > q.first || ( p.first == q.first && p.second >= q.second );
+}
Index: doc/papers/general/evaluation/cfa-pair.h
===================================================================
--- doc/papers/general/evaluation/cfa-pair.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-pair.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,28 @@
+#pragma once
+
+forall(otype R, otype S) struct pair {
+	R first;
+	S second;
+};
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?<?(R, R); int ?<?(S, S); })
+int ?<?(pair(R, S) p, pair(R, S) q);
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?<?(R, R); int ?<=?(S, S); })
+int ?<=?(pair(R, S) p, pair(R, S) q);
+
+forall(otype R, otype S | { int ?==?(R, R); int ?==?(S, S); })
+int ?==?(pair(R, S) p, pair(R, S) q);
+
+forall(otype R, otype S | { int ?!=?(R, R); int ?!=?(S, S); })
+int ?!=?(pair(R, S) p, pair(R, S) q);
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?>?(R, R); int ?>?(S, S); })
+int ?>?(pair(R, S) p, pair(R, S) q);
+
+forall(otype R, otype S 
+	| { int ?==?(R, R); int ?>?(R, R); int ?>=?(S, S); })
+int ?>=?(pair(R, S) p, pair(R, S) q);
Index: doc/papers/general/evaluation/cfa-print.c
===================================================================
--- doc/papers/general/evaluation/cfa-print.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-print.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,29 @@
+#include <stdio.h>
+#include "cfa-pair.h"
+#include "cfa-print.h"
+
+forall(otype T, ttype Params | { void print(FILE*, T); void print(FILE*, Params); })
+void print(FILE* out, T arg, Params rest) {
+	print(out, arg);
+	print(out, rest);
+}
+
+void print(FILE* out, const char* x) { fprintf(out, "%s", x); }
+
+void print(FILE* out, _Bool x) { fprintf(out, "%s", x ? "true" : "false"); }
+
+void print(FILE* out, char x) {
+	if ( 0x20 <= x && x <= 0x7E ) { fprintf(out, "'%c'", x); }
+	else { fprintf(out, "'\\%x'", x); }
+}
+
+void print(FILE* out, int x) { fprintf(out, "%d", x); }
+
+forall(otype R, otype S | { void print(FILE*, R); void print(FILE*, S); })
+void print(FILE* out, pair(R, S) x) {
+	fprintf(out, "[");
+	print(out, x.first);
+	fprintf(out, ", ");
+	print(out, x.second);
+	fprintf(out, "]");
+}
Index: doc/papers/general/evaluation/cfa-print.h
===================================================================
--- doc/papers/general/evaluation/cfa-print.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-print.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,14 @@
+#pragma once
+#include <stdio.h>
+#include "cfa-pair.h"
+
+forall(otype T, ttype Params | { void print(FILE*, T); void print(FILE*, Params); })
+void print(FILE* out, T arg, Params rest);
+
+void print(FILE* out, const char* x);
+void print(FILE* out, _Bool x);
+void print(FILE* out, char x);
+void print(FILE* out, int x);
+
+forall(otype R, otype S | { void print(FILE*, R); void print(FILE*, S); })
+void print(FILE* out, pair(R, S) x);
Index: doc/papers/general/evaluation/cfa-stack.c
===================================================================
--- doc/papers/general/evaluation/cfa-stack.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-stack.c	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,52 @@
+#include <stdlib>
+#include "cfa-stack.h"
+
+forall(otype T) struct stack_node {
+	T value;
+	stack_node(T)* next;
+};
+
+forall(otype T) void ?{}(stack(T)* s) { (&s->head){ 0 }; }
+
+forall(otype T) void ?{}(stack(T)* s, stack(T) t) {
+	stack_node(T)** crnt = &s->head;
+	for ( stack_node(T)* next = t.head; next; next = next->next ) {
+		*crnt = ((stack_node(T)*)malloc()){ next->value }; /***/
+		stack_node(T)* acrnt = *crnt;
+		crnt = &acrnt->next;
+	}
+	*crnt = 0;
+}
+
+forall(otype T) stack(T) ?=?(stack(T)* s, stack(T) t) {
+	if ( s->head == t.head ) return *s;
+	clear(s);
+	s{ t };
+	return *s;
+}
+
+forall(otype T) void ^?{}(stack(T)* s) { clear(s); }
+
+forall(otype T) _Bool empty(const stack(T)* s) { return s->head == 0; }
+
+forall(otype T) void push(stack(T)* s, T value) {
+	s->head = ((stack_node(T)*)malloc()){ value, s->head }; /***/
+}
+
+forall(otype T) T pop(stack(T)* s) {
+	stack_node(T)* n = s->head;
+	s->head = n->next;
+	T x = n->value;
+	^n{};
+	free(n);
+	return x;
+}
+
+forall(otype T) void clear(stack(T)* s) {
+    for ( stack_node(T)* next = s->head; next; ) {
+		stack_node(T)* crnt = next;
+		next = crnt->next;
+		delete(crnt);
+	}
+	s->head = 0;
+}
Index: doc/papers/general/evaluation/cfa-stack.h
===================================================================
--- doc/papers/general/evaluation/cfa-stack.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cfa-stack.h	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,16 @@
+#pragma once
+
+forall(otype T) struct stack_node;
+forall(otype T) struct stack {
+	stack_node(T)* head;
+};
+
+forall(otype T) void ?{}(stack(T)* s);
+forall(otype T) void ?{}(stack(T)* s, stack(T) t);
+forall(otype T) stack(T) ?=?(stack(T)* s, stack(T) t);
+forall(otype T) void ^?{}(stack(T)* s);
+
+forall(otype T) _Bool empty(const stack(T)* s);
+forall(otype T) void push(stack(T)* s, T value);
+forall(otype T) T pop(stack(T)* s);
+forall(otype T) void clear(stack(T)* s);
Index: doc/papers/general/evaluation/cpp-bench.cpp
===================================================================
--- doc/papers/general/evaluation/cpp-bench.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-bench.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,27 @@
+#include <algorithm>
+#include <fstream>
+#include "bench.hpp"
+#include "cpp-stack.hpp"
+#include "cpp-pair.hpp"
+#include "cpp-print.hpp"
+
+int main(int argc, char** argv) {
+	std::ofstream out{"/dev/null"};
+	int maxi = 0, vali = 42;
+	stack<int> si, ti;
+	
+	REPEAT_TIMED( "push_int", N, si.push( vali ); )
+	TIMED( "copy_int", ti = si; )
+	TIMED( "clear_int", si.clear(); )
+	REPEAT_TIMED( "pop_int", N, maxi = std::max( maxi, ti.pop() ); )
+	REPEAT_TIMED( "print_int", N/2, print( out, vali, ":", vali, "\n" ); )
+
+	pair<bool, char> maxp = { false, '\0' }, valp = { true, 'a' };
+	stack<pair<bool, char>> sp, tp;
+	
+	REPEAT_TIMED( "push_pair", N, sp.push( valp ); )
+	TIMED( "copy_pair", tp = sp; )
+	TIMED( "clear_pair", sp.clear(); )
+	REPEAT_TIMED( "pop_pair", N, maxp = std::max( maxp, tp.pop() ); )
+	REPEAT_TIMED( "print_pair", N/2, print( out, valp, ":", valp, "\n" ); )
+}
Index: doc/papers/general/evaluation/cpp-pair.hpp
===================================================================
--- doc/papers/general/evaluation/cpp-pair.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-pair.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,30 @@
+#pragma once
+#include <utility>
+
+template<typename R, typename S> struct pair {
+	R first;
+	S second;
+
+	pair() = default;
+	pair( R&& x, S&& y ) : first( std::move(x) ), second( std::move(y) ) {}
+
+	bool operator< (const pair<R, S>& o) const {
+		return first < o.first || ( first == o.first && second < o.second );
+	}
+
+	bool operator<= (const pair<R, S>& o) const {
+		return first < o.first || ( first == o.first && second <= o.second );
+	}
+
+	bool operator== (const pair<R, S>& o) const { return first == o.first && second == o.second; }
+
+	bool operator!= (const pair<R, S>& o) const { return first != o.first || second != o.second; }
+
+	bool operator> (const pair<R, S>& o) const {
+		return first > o.first || ( first == o.first && second > o.second );
+	}
+
+	bool operator>= (const pair<R, S>& o) const {
+		return first > o.first || ( first == o.first && second >= o.second );
+	}
+};
Index: doc/papers/general/evaluation/cpp-print.hpp
===================================================================
--- doc/papers/general/evaluation/cpp-print.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-print.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,28 @@
+#pragma once
+#include <iomanip>
+#include <iostream>
+#include "cpp-pair.hpp"
+
+template<typename T> void print(std::ostream& out, const T& x) { out << x; }
+
+template<> void print<bool>(std::ostream& out, const bool& x) { out << (x ? "true" : "false"); }
+
+template<> void print<char>(std::ostream& out, const char& x ) {
+	if ( 0x20 <= x && x <= 0x7E ) { out << "'" << x << "'"; }
+	else { out << "'\\" << std::hex << (unsigned int)x << std::setbase(0) << "'"; }
+}
+
+template<typename R, typename S> 
+std::ostream& operator<< (std::ostream& out, const pair<R, S>& x) {
+	out << "[";
+	print(out, x.first);
+	out << ", ";
+	print(out, x.second);
+	return out << "]";
+}
+
+template<typename T, typename... Args> 
+void print(std::ostream& out, const T& arg, const Args&... rest) {
+	out << arg;
+	print(out, rest...);
+}
Index: doc/papers/general/evaluation/cpp-stack.hpp
===================================================================
--- doc/papers/general/evaluation/cpp-stack.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-stack.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,61 @@
+#pragma once
+#include <utility>
+
+template<typename T> class stack {
+	struct node {
+		T value;
+		node* next;
+
+		node( const T& v, node* n = nullptr ) : value(v), next(n) {}
+	};
+	node* head;
+
+	void copy(const stack<T>& o) {
+		node** crnt = &head;
+		for ( node* next = o.head; next; next = next->next ) {
+			*crnt = new node{ next->value }; /***/
+			crnt = &(*crnt)->next;
+		}
+		*crnt = nullptr;
+	}
+public:
+	void clear() {
+	    for ( node* next = head; next; ) {
+			node* crnt = next;
+			next = crnt->next;
+			delete crnt;
+		}
+		head = nullptr;
+	}
+
+	stack() : head(nullptr) {}
+	stack(const stack<T>& o) { copy(o); }
+	stack(stack<T>&& o) : head(o.head) { o.head = nullptr; }
+	~stack() { clear(); }
+
+	stack& operator= (const stack<T>& o) {
+		if ( this == &o ) return *this;
+		clear();
+		copy(o);
+		return *this;
+	}
+
+	stack& operator= (stack<T>&& o) {
+		if ( this == &o ) return *this;
+		head = o.head;
+		o.head = nullptr;
+		return *this;
+	}
+
+	bool empty() const { return head == nullptr; }
+
+	void push(const T& value) { head = new node{ value, head };  /***/ }
+
+	T pop() {
+		node* n = head;
+		head = n->next;
+		T x = std::move(n->value);
+		delete n;
+		return x;
+	}
+};
Index: doc/papers/general/evaluation/cpp-vbench.cpp
===================================================================
--- doc/papers/general/evaluation/cpp-vbench.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-vbench.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,30 @@
+#include <algorithm>
+#include <fstream>
+#include "bench.hpp"
+#include "cpp-vstack.hpp"
+#include "cpp-vprint.hpp"
+#include "object.hpp"
+
+int main(int argc, char** argv) {
+	std::ofstream out{"/dev/null"};
+	integer maxi{ 0 }, vali{ 42 };
+	stack si, ti;
+	
+	REPEAT_TIMED( "push_int", N, si.push( vali ); )
+	TIMED( "copy_int", ti = si; )
+	TIMED( "clear_int", si.clear(); )
+	REPEAT_TIMED( "pop_int", N, maxi = std::max( maxi, ti.pop()->as<integer>() ); /***/ )
+	REPEAT_TIMED( "print_int", N/2, print( out, vali, c_string{":"}, vali, c_string{"\n"} ); )
+
+	ptr<pair> maxp = make<pair>( make<boolean>(false), make<character>('\0') );
+	pair valp{ make<boolean>(true), make<character>('a') };
+	stack sp, tp;
+	
+	REPEAT_TIMED( "push_pair", N, sp.push( valp ); )
+	TIMED( "copy_pair", tp = sp; )
+	TIMED( "clear_pair", sp.clear(); )
+	REPEAT_TIMED( "pop_pair", N, 
+		ptr<pair> xp = as_ptr<pair>( tp.pop() ); /***/
+		if ( *xp > *maxp ) { maxp = std::move(xp); } )
+	REPEAT_TIMED( "print_pair", N/2, print( out, valp, c_string{":"}, valp, c_string{"\n"} ); )
+}
Index: doc/papers/general/evaluation/cpp-vprint.hpp
===================================================================
--- doc/papers/general/evaluation/cpp-vprint.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-vprint.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,10 @@
+#pragma once
+#include <ostream>
+#include "object.hpp"
+
+void print(std::ostream& out, const printable& x) { x.print(out); }
+
+template<typename... Args> void print(std::ostream& out, const printable& x, const Args&... rest) {
+	x.print(out);
+	print(out, rest...);
+}
Index: doc/papers/general/evaluation/cpp-vstack.cpp
===================================================================
--- doc/papers/general/evaluation/cpp-vstack.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-vstack.cpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,54 @@
+#include "cpp-vstack.hpp"
+#include <utility>
+
+stack::node::node( const object& v, node* n ) : value( v.new_copy() ), next( n ) {}
+
+void stack::copy(const stack& o) {
+	node** crnt = &head;
+	for ( node* next = o.head; next; next = next->next ) {
+		*crnt = new node{ *next->value };
+		crnt = &(*crnt)->next;
+	}
+	*crnt = nullptr;
+}
+
+stack::stack() : head(nullptr) {}
+stack::stack(const stack& o) { copy(o); }
+stack::stack(stack&& o) : head(o.head) { o.head = nullptr; }
+stack::~stack() { clear(); }
+
+stack& stack::operator= (const stack& o) {
+	if ( this == &o ) return *this;
+	clear();
+	copy(o);
+	return *this;
+}
+
+stack& stack::operator= (stack&& o) {
+	if ( this == &o ) return *this;
+	head = o.head;
+	o.head = nullptr;
+	return *this;
+}
+
+void stack::clear() {
+    for ( node* next = head; next; ) {
+		node* crnt = next;
+		next = crnt->next;
+		delete crnt;
+	}
+	head = nullptr;
+}
+
+
+bool stack::empty() const { return head == nullptr; }
+
+void stack::push(const object& value) { head = new node{ value, head }; /***/ }
+
+ptr<object> stack::pop() {
+	node* n = head;
+	head = n->next;
+	ptr<object> x = std::move(n->value);
+	delete n;
+	return x;
+}
Index: doc/papers/general/evaluation/cpp-vstack.hpp
===================================================================
--- doc/papers/general/evaluation/cpp-vstack.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/cpp-vstack.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,26 @@
+#pragma once
+#include "object.hpp"
+
+class stack {
+	struct node {
+		ptr<object> value;
+		node* next;
+
+		node( const object& v, node* n = nullptr );
+	};
+	node* head;
+
+	void copy(const stack& o);
+public:
+	stack();
+	stack(const stack& o);
+	stack(stack&& o);
+	~stack();
+	stack& operator= (const stack& o);
+	stack& operator= (stack&& o);
+
+	void clear();
+	bool empty() const;
+	void push(const object& value);
+	ptr<object> pop();
+};
Index: doc/papers/general/evaluation/object.hpp
===================================================================
--- doc/papers/general/evaluation/object.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/object.hpp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,199 @@
+#pragma once
+#include <cstddef>
+#include <exception>
+#include <iomanip>
+#include <memory>
+#include <ostream>
+#include <string>
+#include <typeinfo>
+#include <typeindex>
+
+class bad_cast : public std::exception {
+	std::string why;
+public:
+	bad_cast( const std::type_index& f, const std::type_index& t ) : std::exception() {
+		why = std::string{"bad cast of "} + f.name() + " to " + t.name();
+	}
+	~bad_cast() override = default;
+	
+	const char* what() const noexcept override { return why.c_str(); }
+};
+
+template<typename T> std::type_index class_of() { return { typeid(T) }; }
+
+template<typename T> using ptr = std::unique_ptr<T>;
+
+struct object {
+	std::type_index get_class() const { return { this ? typeid(*this) : typeid(std::nullptr_t) }; }
+
+	template<typename T> T& as() {
+		T* p = dynamic_cast<T*>(this);
+		if ( !p ) throw bad_cast{ get_class(), class_of<T>() };
+		return *p;
+	}
+
+	template<typename T> const T& as() const {
+		const T* p = dynamic_cast<const T*>(this);
+		if ( !p ) throw bad_cast{ get_class(), class_of<T>() };
+		return *p;
+	}
+
+	virtual ptr<object> new_inst() const = 0;
+	virtual ptr<object> new_copy() const = 0;
+	virtual object& operator= (const object&) = 0;
+	virtual ~object() = default;
+};
+
+template<typename T, typename... Args> static inline ptr<T> make(Args&&... args) {
+	return std::make_unique<T>(std::forward<Args>(args)...);
+}
+
+template<typename To, typename From> 
+ptr<To> as_ptr( ptr<From>&& p ) { return ptr<To>{ &p.release()->template as<To>() }; }
+
+struct ordered : public virtual object {
+	virtual int cmp(const ordered&) const = 0;
+
+	bool operator< (const ordered& that) const { return cmp(that) < 0; }
+	bool operator<= ( const ordered& that ) const { return cmp(that) <= 0; }
+	bool operator== ( const ordered& that ) const { return cmp(that) == 0; }
+	bool operator!= ( const ordered& that ) const { return cmp(that) != 0; }
+	bool operator> ( const ordered& that ) const { return cmp(that) > 0; }
+	bool operator>= ( const ordered& that ) const { return cmp(that) >= 0; }
+};
+
+struct printable : public virtual object {
+	virtual void print(std::ostream&) const = 0;
+};
+
+class boolean : public ordered, public printable {
+	bool x;
+public:
+	boolean() = default;
+	boolean(bool x) : x(x) {}
+	boolean(const boolean&) = default;
+	boolean(boolean&&) = default;
+	ptr<object> new_inst() const override { return make<boolean>(); }
+	ptr<object> new_copy() const override { return make<boolean>(*this); }
+	boolean& operator= (const boolean& that) {
+		x = that.x;
+		return *this;	
+	}
+	object& operator= (const object& that) override { return *this = that.as<boolean>(); } /***/
+	boolean& operator= (boolean&&) = default;
+	~boolean() override = default;
+
+	int cmp(const boolean& that) const { return x == that.x ? 0 : x == false ? -1 : 1; }
+	int cmp(const ordered& that) const override { return cmp( that.as<boolean>() ); } /***/
+
+	void print(std::ostream& out) const override { out << (x ? "true" : "false"); }
+};
+
+class character : public ordered, public printable {
+	char x;
+public:
+	character() = default;
+	character(char x) : x(x) {}
+	character(const character&) = default;
+	character(character&&) = default;
+	ptr<object> new_inst() const override { return make<character>(); }
+	ptr<object> new_copy() const override { return make<character>(*this); }
+	character& operator= (const character& that) {
+		x = that.x;
+		return *this;	
+	}
+	object& operator= (const object& that) override { return *this = that.as<character>(); } /***/
+	character& operator= (character&&) = default;
+	~character() override = default;
+
+	int cmp(const character& that) const { return x == that.x ? 0 : x < that.x ? -1 : 1; }
+	int cmp(const ordered& that) const override { return cmp( that.as<character>() ); } /***/
+
+	void print(std::ostream& out) const override {
+		if ( 0x20 <= x && x <= 0x7E ) { out << "'" << x << "'"; }
+		else { out << "'\\" << std::hex << (unsigned int)x << std::setbase(0) << "'"; }
+	}
+};
+
+class integer : public ordered, public printable {
+	int x;
+public:
+	integer() = default;
+	integer(int x) : x(x) {}
+	integer(const integer&) = default;
+	integer(integer&&) = default;
+	ptr<object> new_inst() const override { return make<integer>(); }
+	ptr<object> new_copy() const override { return make<integer>(*this); }
+	integer& operator= (const integer& that) {
+		x = that.x;
+		return *this;	
+	}
+	object& operator= (const object& that) override { return *this = that.as<integer>(); } /***/
+	integer& operator= (integer&&) = default;
+	~integer() override = default;
+
+	int cmp(const integer& that) const { return x == that.x ? 0 : x < that.x ? -1 : 1; }
+	int cmp(const ordered& that) const override { return cmp( that.as<integer>() ); } /***/
+
+	void print(std::ostream& out) const override { out << x; }
+};
+
+class c_string : public printable {
+	static constexpr const char* empty = "";
+	const char* s;
+public:
+	c_string() : s(empty) {}
+	c_string(const char* s) : s(s) {}
+	c_string(const c_string&) = default;
+	c_string(c_string&&) = default;
+	ptr<object> new_inst() const override { return make<c_string>(); }
+	ptr<object> new_copy() const override { return make<c_string>(s); }
+	c_string& operator= (const c_string& that) {
+		s = that.s;
+		return *this;
+	}
+	object& operator= (const object& that) override { return *this = that.as<c_string>(); } /***/
+	c_string& operator= (c_string&&) = default;
+	~c_string() override = default;
+
+	void print(std::ostream& out) const override { out << s; }
+};
+
+class pair : public ordered, public printable {
+	ptr<object> x;
+	ptr<object> y;
+public:
+	pair() = default;
+	pair(ptr<object>&& x, ptr<object>&& y) : x(std::move(x)), y(std::move(y)) {}
+	pair(const pair& that) : x(that.x->new_copy()), y(that.y->new_copy()) {}
+	pair(pair&& that) : x(std::move(that.x)), y(std::move(that.y)) {}
+	ptr<object> new_inst() const override { return make<pair>(); }
+	ptr<object> new_copy() const override { return make<pair>(x->new_copy(), y->new_copy()); }
+	pair& operator= (const pair& that) {
+		x = that.x->new_copy();
+		y = that.y->new_copy();
+		return *this;
+	}
+	object& operator= (const object& that) override { return *this = that.as<pair>(); } /***/
+	pair& operator= (pair&& that) {
+		x = std::move(that.x);
+		y = std::move(that.y);
+		return *this;
+	}
+	~pair() override = default;
+
+	int cmp(const pair& that) const {
+		int c = x->as<ordered>().cmp( that.x->as<ordered>() ); /***/
+		if ( c != 0 ) return c;
+		return y->as<ordered>().cmp( that.y->as<ordered>() ); /***/
+	}
+	int cmp(const ordered& that) const override { return cmp( that.as<pair>() ); }
+
+	void print(std::ostream& out) const override {
+		out << "[";
+		x->as<printable>().print(out); /***/
+		out << ", ";
+		y->as<printable>().print(out); /***/
+		out << "]";
+	}
+};
Index: doc/papers/general/evaluation/timing.dat
===================================================================
--- doc/papers/general/evaluation/timing.dat	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/timing.dat	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,11 @@
+"400 million repetitions"	"C"	"\\CFA{}"	"\\CC{}"	"\\CC{obj}"
+"push\nint"	3002	2459	1520	3305
+"copy\nint"	2985	2057	1521	3152
+"clear\nint"	1374	827	718	1469
+"pop\nint"	1416	1221	717	5467
+"print\nint"	5656	6758	3120	3121
+"push\npair"	4214	2752	946	6826
+"copy\npair"	6127	2105	993	7330
+"clear\npair"	2881	885	711	3564
+"pop\npair"	3046	5434	783	26538
+"print\npair"	7514	10714	8717	16525
Index: doc/papers/general/evaluation/timing.gp
===================================================================
--- doc/papers/general/evaluation/timing.gp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
+++ doc/papers/general/evaluation/timing.gp	(revision 604e76dc97ade3d13004047eb233d487f70f6289)
@@ -0,0 +1,26 @@
+# set terminal pdfcairo linewidth 3 size 6,3
+# set output "timing.pdf"
+set terminal pslatex size 6.25,2.125 color solid
+set output "timing.tex"
+
+set pointsize 2.0
+set grid linetype 0
+set style data histogram
+set style histogram cluster gap 2
+set style fill solid border -1
+set offset -0.5,-0.35
+set boxwidth 0.8
+
+set key top left reverse Left
+
+set style fill solid noborder
+set linetype 1 lc rgb 'black'
+set linetype 2 lc rgb 'red'
+set linetype 3 lc rgb 'blue'
+set linetype 4 lc rgb 'green'
+
+SCALE=1000
+set ylabel "seconds"
+
+# set datafile separator ","
+plot for [COL=2:5] 'evaluation/timing.dat' using (column(COL)/SCALE):xticlabels(1) title columnheader
