Index: doc/papers/concurrency/.gitignore
===================================================================
--- doc/papers/concurrency/.gitignore	(revision b0c32da8d2956a86a73c0000c8f8110a5c0a64ba)
+++ doc/papers/concurrency/.gitignore	(revision e2e7330de2695ce8f95408c682ce20f6f31239db)
@@ -1,2 +1,3 @@
+# generated by latex
 build/*
 *.pdf
Index: doc/papers/concurrency/Makefile
===================================================================
--- doc/papers/concurrency/Makefile	(revision b0c32da8d2956a86a73c0000c8f8110a5c0a64ba)
+++ doc/papers/concurrency/Makefile	(revision e2e7330de2695ce8f95408c682ce20f6f31239db)
@@ -2,10 +2,12 @@
 
 Build = build
+Figures = figures
 Macros = ../../LaTeXmacros
-TeXLIB = .:style:annex:${Macros}:${Build}:/usr/local/bibliographies:
-LaTeX  = TEXINPUTS=${TeXLIB} && export TEXINPUTS && latex -halt-on-error -output-directory=${Build} -interaction=nonstopmode
+TeXLIB = .:style:annex:${Macros}:${Build}:../../bibliography:
+LaTeX  = TEXINPUTS=${TeXLIB} && export TEXINPUTS && latex -halt-on-error -output-directory=${Build}
 BibTeX = BIBINPUTS=${TeXLIB} && export BIBINPUTS && bibtex
 
 MAKEFLAGS = --no-print-directory --silent #
+VPATH = ${Figures}
 
 ## Define the text source files.
@@ -41,5 +43,5 @@
 # Directives #
 
-.PHONY : all clean										# not file names
+.PHONY : all clean					# not file names
 
 all : ${DOCUMENT}
@@ -57,31 +59,30 @@
 
 ${basename ${DOCUMENT}}.dvi : Makefile ${Build} ${GRAPHS} ${PROGRAMS} ${PICTURES} ${FIGURES} ${SOURCES} ${basename ${DOCUMENT}}.tex \
-			${Macros}/common.tex ${Macros}/indexstyle annex/local.bib /usr/local/bibliographies/pl.bib
-	if [ ! -r ${basename $@}.aux ] ; then ${LaTeX} ${basename $@}.tex ; fi # Must have *.aux file containing citations for bibtex
-	-${BibTeX} ${Build}/${basename $@}					# Some citations reference others so run again to resolve these citations
+		${Macros}/common.tex ${Macros}/indexstyle annex/local.bib ../../bibliography/pl.bib
+	# Must have *.aux file containing citations for bibtex
+	if [ ! -r ${basename $@}.aux ] ; then ${LaTeX} ${basename $@}.tex ; fi
+	-${BibTeX} ${Build}/${basename $@}
+	# Some citations reference others so run again to resolve these citations
 	${LaTeX} ${basename $@}.tex
 	-${BibTeX} ${Build}/${basename $@}
-	${LaTeX} ${basename $@}.tex							# Finish citations
+	# Run again to finish citations
+	${LaTeX} ${basename $@}.tex
 
 ## Define the default recipes.
-
-vpath %.tex ${subst .zzz,,${subst .zzz ,:,${SOURCES}}}	# add prefix for source
-vpath %.fig ${subst .zzz,,${subst .zzz ,:,${SOURCES}}}	# add prefix for source
 
 ${Build}:
 	mkdir -p ${Build}
 
-%.tex : figures/%.fig
+%.tex : %.fig
 	fig2dev -L eepic $< > ${Build}/$@
 
-%.ps : figures/%.fig
+%.ps : %.fig
 	fig2dev -L ps $< > ${Build}/$@
 
-%.pstex : figures/%.fig
+%.pstex : %.fig
 	fig2dev -L pstex $< > ${Build}/$@
 	fig2dev -L pstex_t -p ${Build}/$@ $< > ${Build}/$@_t
 
 # Local Variables: #
-# tab-width : 4 #
 # compile-command: "make" #
 # End: #
Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision b0c32da8d2956a86a73c0000c8f8110a5c0a64ba)
+++ doc/papers/concurrency/Paper.tex	(revision e2e7330de2695ce8f95408c682ce20f6f31239db)
@@ -1,10 +1,8 @@
-% requires tex packages: texlive-base texlive-latex-base tex-common texlive-humanities texlive-latex-extra texlive-fonts-recommended
-
-% inline code �...� (copyright symbol) emacs: C-q M-)
-% red highlighting �...� (registered trademark symbol) emacs: C-q M-.
-% blue highlighting �...� (sharp s symbol) emacs: C-q M-_
-% green highlighting �...� (cent symbol) emacs: C-q M-"
-% LaTex escape �...� (section symbol) emacs: C-q M-'
-% keyword escape �...� (pilcrow symbol) emacs: C-q M-^
+% inline code ©...© (copyright symbol) emacs: C-q M-)
+% red highlighting ®...® (registered trademark symbol) emacs: C-q M-.
+% blue highlighting ß...ß (sharp s symbol) emacs: C-q M-_
+% green highlighting ¢...¢ (cent symbol) emacs: C-q M-"
+% LaTex escape §...§ (section symbol) emacs: C-q M-'
+% keyword escape ¶...¶ (pilcrow symbol) emacs: C-q M-^
 % math escape $...$ (dollar symbol)
 
@@ -20,5 +18,4 @@
 \usepackage{epic,eepic}
 \usepackage{upquote}						% switch curled `'" to straight
-\usepackage{dirtytalk}
 \usepackage{calc}
 \usepackage{xspace}
@@ -46,7 +43,4 @@
 \urlstyle{rm}
 
-\usepackage{tikz}
-\def\checkmark{\tikz\fill[scale=0.4](0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;}
-
 \setlength{\topmargin}{-0.45in}				% move running title into header
 \setlength{\headsep}{0.25in}
@@ -78,5 +72,5 @@
 
 \title{Concurrency in \CFA}
-\author{Thierry Delisle, Waterloo, Ontario, Canada, 2018}
+\author{Thierry Delisle and Peter A. Buhr, Waterloo, Ontario, Canada}
 
 
@@ -85,5 +79,6 @@
 
 \begin{abstract}
-\CFA is a modern, non-object-oriented extension of the C programming language. This thesis serves as a definition and an implementation for the concurrency and parallelism \CFA offers. These features are created from scratch due to the lack of concurrency in ISO C. Lightweight threads are introduced into the language. In addition, monitors are introduced as a high-level tool for control-flow based synchronization and mutual-exclusion. The main contributions of this thesis are two-fold: it extends the existing semantics of monitors introduce by~\cite{Hoare74} to handle monitors in groups and also details the engineering effort needed to introduce these features as core language features. Indeed, these features are added with respect to expectations of C programmers, and integrate with the \CFA type-system and other language features.
+\CFA is a modern, \emph{non-object-oriented} extension of the C programming language.
+This paper serves as a definition and an implementation for the concurrency and parallelism \CFA offers. These features are created from scratch due to the lack of concurrency in ISO C. Lightweight threads are introduced into the language. In addition, monitors are introduced as a high-level tool for control-flow based synchronization and mutual-exclusion. The main contributions of this paper are two-fold: it extends the existing semantics of monitors introduce by~\cite{Hoare74} to handle monitors in groups and also details the engineering effort needed to introduce these features as core language features. Indeed, these features are added with respect to expectations of C programmers, and integrate with the \CFA type-system and other language features.
 \end{abstract}
 
@@ -95,9 +90,10 @@
 \section{Introduction}
 % ======================================================================
-This thesis provides a minimal concurrency \textbf{api} that is simple, efficient and can be reused to build higher-level features. The simplest possible concurrency system is a thread and a lock but this low-level approach is hard to master. An easier approach for users is to support higher-level constructs as the basis of concurrency. Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{HPP:Study}. Examples are task based, message passing and implicit threading. The high-level approach and its minimal \textbf{api} are tested in a dialect of C, called \CFA. Furthermore, the proposed \textbf{api} doubles as an early definition of the \CFA language and library. This thesis also provides an implementation of the concurrency library for \CFA as well as all the required language features added to the source-to-source translator.
+
+This paper provides a minimal concurrency \textbf{api} that is simple, efficient and can be reused to build higher-level features. The simplest possible concurrency system is a thread and a lock but this low-level approach is hard to master. An easier approach for users is to support higher-level constructs as the basis of concurrency. Indeed, for highly productive concurrent programming, high-level approaches are much more popular~\cite{HPP:Study}. Examples are task based, message passing and implicit threading. The high-level approach and its minimal \textbf{api} are tested in a dialect of C, called \CFA. Furthermore, the proposed \textbf{api} doubles as an early definition of the \CFA language and library. This paper also provides an implementation of the concurrency library for \CFA as well as all the required language features added to the source-to-source translator.
 
 There are actually two problems that need to be solved in the design of concurrency for a programming language: which concurrency and which parallelism tools are available to the programmer. While these two concepts are often combined, they are in fact distinct, requiring different tools~\cite{Buhr05a}. Concurrency tools need to handle mutual exclusion and synchronization, while parallelism tools are about performance, cost and resource utilization.
 
-In the context of this thesis, a \textbf{thread} is a fundamental unit of execution that runs a sequence of code, generally on a program stack. Having multiple simultaneous threads gives rise to concurrency and generally requires some kind of locking mechanism to ensure proper execution. Correspondingly, \textbf{concurrency} is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, etc.) concurrent threads are introduced. Accordingly, \textbf{locking} (and by extension locks) are defined as a mechanism that prevents the progress of certain threads in order to avoid problems due to concurrency. Finally, in this thesis \textbf{parallelism} is distinct from concurrency and is defined as running multiple threads simultaneously. More precisely, parallelism implies \emph{actual} simultaneous execution as opposed to concurrency which only requires \emph{apparent} simultaneous execution. As such, parallelism is only observable in the differences in performance or, more generally, differences in timing.
+In the context of this paper, a \textbf{thread} is a fundamental unit of execution that runs a sequence of code, generally on a program stack. Having multiple simultaneous threads gives rise to concurrency and generally requires some kind of locking mechanism to ensure proper execution. Correspondingly, \textbf{concurrency} is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, etc.) concurrent threads are introduced. Accordingly, \textbf{locking} (and by extension locks) are defined as a mechanism that prevents the progress of certain threads in order to avoid problems due to concurrency. Finally, in this paper \textbf{parallelism} is distinct from concurrency and is defined as running multiple threads simultaneously. More precisely, parallelism implies \emph{actual} simultaneous execution as opposed to concurrency which only requires \emph{apparent} simultaneous execution. As such, parallelism is only observable in the differences in performance or, more generally, differences in timing.
 
 % ======================================================================
@@ -113,5 +109,5 @@
 
 % ======================================================================
-\section{References}
+\subsection{References}
 
 Like \CC, \CFA introduces rebind-able references providing multiple dereferencing as an alternative to pointers. In regards to concurrency, the semantic difference between pointers and references are not particularly relevant, but since this document uses mostly references, here is a quick overview of the semantics:
@@ -132,5 +128,5 @@
 
 % ======================================================================
-\section{Overloading}
+\subsection{Overloading}
 
 Another important feature of \CFA is function overloading as in Java and \CC, where routines with the same name are selected based on the number and type of the arguments. As well, \CFA uses the return type as part of the selection criteria, as in Ada~\cite{Ada}. For routines with multiple parameters and returns, the selection is complex.
@@ -153,5 +149,5 @@
 
 % ======================================================================
-\section{Operators}
+\subsection{Operators}
 Overloading also extends to operators. The syntax for denoting operator-overloading is to name a routine with the symbol of the operator and question marks where the arguments of the operation appear, e.g.:
 \begin{cfacode}
@@ -173,5 +169,5 @@
 
 % ======================================================================
-\section{Constructors/Destructors}
+\subsection{Constructors/Destructors}
 Object lifetime is often a challenge in concurrency. \CFA uses the approach of giving concurrent meaning to object lifetime as a means of synchronization and/or mutual exclusion. Since \CFA relies heavily on the lifetime of objects, constructors and destructors is a core feature required for concurrency and parallelism. \CFA uses the following syntax for constructors and destructors:
 \begin{cfacode}
@@ -208,5 +204,5 @@
 
 % ======================================================================
-\section{Parametric Polymorphism}
+\subsection{Parametric Polymorphism}
 \label{s:ParametricPolymorphism}
 Routines in \CFA can also be reused for multiple types. This capability is done using the \code{forall} clauses, which allow separately compiled routines to support generic usage over multiple types. For example, the following sum function works for any type that supports construction from 0 and addition:
@@ -241,5 +237,5 @@
 
 % ======================================================================
-\section{with Clause/Statement}
+\subsection{with Clause/Statement}
 Since \CFA lacks the concept of a receiver, certain functions end up needing to repeat variable names often. To remove this inconvenience, \CFA provides the \code{with} statement, which opens an aggregate scope making its fields directly accessible (like Pascal).
 \begin{cfacode}
@@ -286,5 +282,5 @@
 
 \section{\protect\CFA's Thread Building Blocks}
-One of the important features that are missing in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the thesis, neither \texttt{gcc} nor \texttt{clang} support ``threads.h'' in their respective standard libraries.}. On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore modern programming languages must have the proper tools to allow users to write efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. And being a system-level language means programmers expect to choose precisely which features they need and which cost they are willing to pay.
+One of the important features that are missing in C is threading\footnote{While the C11 standard defines a ``threads.h'' header, it is minimal and defined as optional. As such, library support for threading is far from widespread. At the time of writing the paper, neither \texttt{gcc} nor \texttt{clang} support ``threads.h'' in their respective standard libraries.}. On modern architectures, a lack of threading is unacceptable~\cite{Sutter05, Sutter05b}, and therefore modern programming languages must have the proper tools to allow users to write efficient concurrent programs to take advantage of parallelism. As an extension of C, \CFA needs to express these concepts in a way that is as natural as possible to programmers familiar with imperative languages. And being a system-level language means programmers expect to choose precisely which features they need and which cost they are willing to pay.
 
 \section{Coroutines: A Stepping Stone}\label{coroutine}
@@ -664,5 +660,5 @@
 \end{cfacode}
 
-In this example, threads of type \code{foo} start execution in the \code{void main(foo &)} routine, which prints \code{"Hello World!".} While this thesis encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.
+In this example, threads of type \code{foo} start execution in the \code{void main(foo &)} routine, which prints \code{"Hello World!".} While this paper encourages this approach to enforce strongly typed programming, users may prefer to use the routine-based thread semantics for the sake of simplicity. With the static semantics it is trivial to write a thread type that takes a function pointer as a parameter and executes it on its stack asynchronously.
 \begin{cfacode}
 typedef void (*voidFunc)(int);
@@ -999,5 +995,5 @@
 % ======================================================================
 % ======================================================================
-In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (i.e., with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (i.e., without access to the shared state). Since internal scheduling within a single monitor is mostly a solved problem, this thesis concentrates on extending internal scheduling to multiple monitors. Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
+In addition to mutual exclusion, the monitors at the core of \CFA's concurrency can also be used to achieve synchronization. With monitors, this capability is generally achieved with internal or external scheduling as in~\cite{Hoare74}. With \textbf{scheduling} loosely defined as deciding which thread acquires the critical section next, \textbf{internal scheduling} means making the decision from inside the critical section (i.e., with access to the shared state), while \textbf{external scheduling} means making the decision when entering the critical section (i.e., without access to the shared state). Since internal scheduling within a single monitor is mostly a solved problem, this paper concentrates on extending internal scheduling to multiple monitors. Indeed, like the \textbf{bulk-acq} semantics, internal scheduling extends to multiple monitors in a way that is natural to the user but requires additional complexity on the implementation side.
 
 First, here is a simple example of internal scheduling:
@@ -1800,8 +1796,8 @@
 A \textbf{cfacluster} is a group of \textbf{kthread} executed in isolation. \textbf{uthread} are scheduled on the \textbf{kthread} of a given \textbf{cfacluster}, allowing organization between \textbf{uthread} and \textbf{kthread}. It is important that \textbf{kthread} belonging to a same \textbf{cfacluster} have homogeneous settings, otherwise migrating a \textbf{uthread} from one \textbf{kthread} to the other can cause issues. A \textbf{cfacluster} also offers a pluggable scheduler that can optimize the workload generated by the \textbf{uthread}.
 
-\textbf{cfacluster} have not been fully implemented in the context of this thesis. Currently \CFA only supports one \textbf{cfacluster}, the initial one.
+\textbf{cfacluster} have not been fully implemented in the context of this paper. Currently \CFA only supports one \textbf{cfacluster}, the initial one.
 
 \subsection{Future Work: Machine Setup}\label{machine}
-While this was not done in the context of this thesis, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. For example, a system using \textbf{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certain CPU cores. OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
+While this was not done in the context of this paper, another important aspect of clusters is affinity. While many common desktop and laptop PCs have homogeneous CPUs, other devices often have more heterogeneous setups. For example, a system using \textbf{numa} configurations may benefit from users being able to tie clusters and/or kernel threads to certain CPU cores. OS support for CPU affinity is now common~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which means it is both possible and desirable for \CFA to offer an abstraction mechanism for portable CPU affinity.
 
 \subsection{Paradigms}\label{cfaparadigms}
@@ -1815,5 +1811,5 @@
 The main memory concern for concurrency is queues. All blocking operations are made by parking threads onto queues and all queues are designed with intrusive nodes, where each node has pre-allocated link fields for chaining, to avoid the need for memory allocation. Since several concurrency operations can use an unbound amount of memory (depending on \textbf{bulk-acq}), statically defining information in the intrusive fields of threads is insufficient.The only way to use a variable amount of memory without requiring memory allocation is to pre-allocate large buffers of memory eagerly and store the information in these buffers. Conveniently, the call stack fits that description and is easy to use, which is why it is used heavily in the implementation of internal scheduling, particularly variable-length arrays. Since stack allocation is based on scopes, the first step of the implementation is to identify the scopes that are available to store the information, and which of these can have a variable-length array. The threads and the condition both have a fixed amount of memory, while \code{mutex} routines and blocking calls allow for an unbound amount, within the stack size.
 
-Note that since the major contributions of this thesis are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
+Note that since the major contributions of this paper are extending monitor semantics to \textbf{bulk-acq} and loose object definitions, any challenges that are not resulting of these characteristics of \CFA are considered as solved problems and therefore not discussed.
 
 % ======================================================================
@@ -2615,5 +2611,5 @@
 
 \section{Conclusion}
-This thesis has achieved a minimal concurrency \textbf{api} that is simple, efficient and usable as the basis for higher-level features. The approach presented is based on a lightweight thread-system for parallelism, which sits on top of clusters of processors. This M:N model is judged to be both more efficient and allow more flexibility for users. Furthermore, this document introduces monitors as the main concurrency tool for users. This thesis also offers a novel approach allowing multiple monitors to be accessed simultaneously without running into the Nested Monitor Problem~\cite{Lister77}. It also offers a full implementation of the concurrency runtime written entirely in \CFA, effectively the largest \CFA code base to date.
+This paper has achieved a minimal concurrency \textbf{api} that is simple, efficient and usable as the basis for higher-level features. The approach presented is based on a lightweight thread-system for parallelism, which sits on top of clusters of processors. This M:N model is judged to be both more efficient and allow more flexibility for users. Furthermore, this document introduces monitors as the main concurrency tool for users. This paper also offers a novel approach allowing multiple monitors to be accessed simultaneously without running into the Nested Monitor Problem~\cite{Lister77}. It also offers a full implementation of the concurrency runtime written entirely in \CFA, effectively the largest \CFA code base to date.
 
 
@@ -2625,5 +2621,5 @@
 
 \subsection{Performance} \label{futur:perf}
-This thesis presents a first implementation of the \CFA concurrency runtime. Therefore, there is still significant work to improve performance. Many of the data structures and algorithms may change in the future to more efficient versions. For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. It may be possible that limiting the number helps increase performance. However, it is not obvious that the benefit would be significant.
+This paper presents a first implementation of the \CFA concurrency runtime. Therefore, there is still significant work to improve performance. Many of the data structures and algorithms may change in the future to more efficient versions. For example, the number of monitors in a single \textbf{bulk-acq} is only bound by the stack size, this is probably unnecessarily generous. It may be possible that limiting the number helps increase performance. However, it is not obvious that the benefit would be significant.
 
 \subsection{Flexible Scheduling} \label{futur:sched}
@@ -2729,11 +2725,6 @@
 \section{Acknowledgements}
 
-I would like to thank my supervisor, Professor Peter Buhr, for his guidance through my degree as well as the editing of this document.
-
-I would like to thank Professors Martin Karsten and Gregor Richards, for reading my thesis and providing helpful feedback.
-
-Thanks to Aaron Moss, Rob Schluntz and Andrew Beach for their work on the \CFA project as well as all the discussions which have helped me concretize the ideas in this thesis.
-
-Finally, I acknowledge that this has been possible thanks to the financial help offered by the David R. Cheriton School of Computer Science and the corporate partnership with Huawei Ltd.
+Thanks to Aaron Moss, Rob Schluntz and Andrew Beach for their work on the \CFA project as well as all the discussions which helped concretize the ideas in this paper.
+Partial funding was supplied by the Natural Sciences and Engineering Research Council of Canada and a corporate partnership with Huawei Ltd.
 
 
@@ -2744,2 +2735,8 @@
 
 \end{document}
+
+% Local Variables: %
+% tab-width: 4 %
+% fill-column: 120 %
+% compile-command: "make" %
+% End: %
