Index: doc/bibliography/pl.bib
===================================================================
--- doc/bibliography/pl.bib	(revision 03bd4070187160e6426b52ebbbd78d50b591d06a)
+++ doc/bibliography/pl.bib	(revision 1dc58fda13aa9cec81035fe578fe64dcb988ab71)
@@ -375,4 +375,17 @@
     year	= 1991,
     pages	= {21-65},
+}
+
+@article{Hoare61,
+    keywords	= {quick sort},
+    contributer	= {pabuhr@plg},
+    author	= {C. A. R. Hoare},
+    title	= {Algorithms 63/64: Partition/Quicksort},
+    journal	= cacm,
+    volume	= 4,
+    number	= 7,
+    month	= jul,
+    year	= 1961,
+    pages	= {321},
 }
 
@@ -5791,5 +5804,5 @@
 @manual{Python,
     keywords	= {Python},
-    contributer	= {pabuhr},
+    contributer	= {pabuhr@plg},
     title	= {Python Reference Manual, Release 2.5},
     author	= {Guido van Rossum},
@@ -5822,15 +5835,17 @@
 }
 
-@article{Hoare61,
-    keywords	= {quick sort},
-    contributer	= {pabuhr@plg},
-    author	= {C. A. R. Hoare},
-    title	= {Algorithms 63/64: Partition/Quicksort},
-    journal	= cacm,
-    volume	= 4,
-    number	= 7,
-    month	= jul,
-    year	= 1961,
-    pages	= {321},
+@article{Nakaike15,
+    keywords	= {hardware transactional memory},
+    contributer	= {pabuhr@plg},
+    author	= {Nakaike, Takuya and Odaira, Rei and Gaudet, Matthew and Michael, Maged M. and Tomari, Hisanobu},
+    title	= {Quantitative Comparison of Hardware Transactional Memory for Blue Gene/Q, zEnterprise {EC12}, {I}ntel Core, and {POWER8}},
+    journal	= {SIGARCH Comput. Archit. News},
+    volume	= {43},
+    number	= {3},
+    month	= jun,
+    year	= {2015},
+    pages	= {144--157},
+    publisher	= {ACM},
+    address	= {New York, NY, USA},
 }
 
Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 03bd4070187160e6426b52ebbbd78d50b591d06a)
+++ doc/papers/concurrency/Paper.tex	(revision 1dc58fda13aa9cec81035fe578fe64dcb988ab71)
@@ -70,6 +70,4 @@
 %\DeclareTextCommandDefault{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.1ex}}}
 \renewcommand{\textunderscore}{\leavevmode\makebox[1.2ex][c]{\rule{1ex}{0.075ex}}}
-%\def\myCHarFont{\fontencoding{T1}\selectfont}%
-% \def\{{\ttfamily\upshape\myCHarFont \char`\}}}%
 
 \renewcommand*{\thefootnote}{\Alph{footnote}} % hack because fnsymbol does not work
@@ -1171,4 +1169,44 @@
 The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency.
 
+Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequential, after all the row threads have terminated.
+The program uses heap-based threads because each thread needs different constructor values.
+(Python provides a simple iteration mechanism to initialize array elements to different values allowing stack allocation.)
+The allocation/deallocation pattern appears unusual because allocated objects are immediately deleted without any intervening code.
+However, for threads, the deletion provides implicit synchronization, which is the intervening code.
+While the subtotals are added in linear order rather than completion order, which slight inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial.
+
+\begin{figure}
+\begin{cfa}
+thread Adder {
+    int * row, cols, * subtotal;			$\C{// communication}$
+};
+void ?{}( Adder & adder, int row[], int cols, int & subtotal ) {
+    adder.[ row, cols, subtotal ] = [ row, cols, &subtotal ];
+}
+void main( Adder & adder ) with( adder ) {
+    *subtotal = 0;
+    for ( int c = 0; c < cols; c += 1 ) {
+		*subtotal += row[c];
+    }
+}
+int main() {
+    const int rows = 10, cols = 1000;
+    int matrix[rows][cols], subtotals[rows], total = 0;
+    // read matrix
+    Adder * adders[rows];
+    for ( int r = 0; r < rows; r += 1 ) {	$\C{// start threads to sum rows}$
+		adders[r] = new( matrix[r], cols, &subtotals[r] );
+    }
+    for ( int r = 0; r < rows; r += 1 ) {	$\C{// wait for threads to finish}$
+		delete( adders[r] );				$\C{// termination join}$
+		total += subtotals[r];				$\C{// total subtotal}$
+    }
+    sout | total | endl;
+}
+\end{cfa}
+\caption{Concurrent Matrix Summation}
+\label{s:ConcurrentMatrixSummation}
+\end{figure}
+
 
 \section{Synchronization / Mutual Exclusion}
@@ -1183,12 +1221,12 @@
 In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm.
 
-At the lowest level, concurrent control is implemented as atomic operations, upon which difference kinds of locks/approaches are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
+At the lowest level, concurrent control is implemented as atomic operations, upon which different kinds of locks mechanism are constructed, \eg semaphores~\cite{Dijkstra68b} and path expressions~\cite{Campbell74}.
 However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}.
-An newer approach worth mentioning is transactional memory~\cite{Herlihy93}.
-While this approach is pursued in hardware~\cite{} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
+A newer approach is transactional memory~\cite{Herlihy93}.
+While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.
 
 One of the most natural, elegant, and efficient mechanisms for synchronization and mutual exclusion for shared-memory systems is the \emph{monitor}.
 Monitors were first proposed by Brinch Hansen~\cite{Hansen73} and later described and extended by C.A.R.~Hoare~\cite{Hoare74}.
-Many programming languages---\eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java}---provide monitors as explicit language constructs.
+Many programming languages -- \eg Concurrent Pascal~\cite{ConcurrentPascal}, Mesa~\cite{Mesa}, Modula~\cite{Modula-2}, Turing~\cite{Turing:old}, Modula-3~\cite{Modula-3}, NeWS~\cite{NeWS}, Emerald~\cite{Emerald}, \uC~\cite{Buhr92a} and Java~\cite{Java} -- provide monitors as explicit language constructs.
 In addition, operating-system kernels and device drivers have a monitor-like structure, although they often use lower-level primitives such as semaphores or locks to simulate monitors.
 For these reasons, this project proposes monitors as the core concurrency construct, upon which even higher-level approaches can be easily constructed..