Index: benchmark/io/http/protocol.cfa
===================================================================
--- benchmark/io/http/protocol.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ benchmark/io/http/protocol.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -173,21 +173,4 @@
 }
 
-static void zero_sqe(struct io_uring_sqe * sqe) {
-	sqe->flags = 0;
-	sqe->ioprio = 0;
-	sqe->fd = 0;
-	sqe->off = 0;
-	sqe->addr = 0;
-	sqe->len = 0;
-	sqe->fsync_flags = 0;
-	sqe->__pad2[0] = 0;
-	sqe->__pad2[1] = 0;
-	sqe->__pad2[2] = 0;
-	sqe->fd = 0;
-	sqe->off = 0;
-	sqe->addr = 0;
-	sqe->len = 0;
-}
-
 enum FSM_STATE {
 	Initial,
Index: doc/theses/mubeen_zulfiqar_MMath/allocator.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/allocator.tex	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ doc/theses/mubeen_zulfiqar_MMath/allocator.tex	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -1,51 +1,10 @@
 \chapter{Allocator}
 
-\noindent
-====================
-
-Writing Points:
-\begin{itemize}
-\item
-Objective of uHeapLmmm.
-\item
-Design philosophy.
-\item
-Background and previous design of uHeapLmmm.
-\item
-Distributed design of uHeapLmmm.
-
------ SHOULD WE GIVE IMPLEMENTATION DETAILS HERE? -----
-
-\PAB{Maybe. There might be an Implementation chapter.}
-\item
-figure.
-\item
-Advantages of distributed design.
-\end{itemize}
-
-The new features added to uHeapLmmm (incl. @malloc_size@ routine)
-\CFA alloc interface with examples.
-
-\begin{itemize}
-\item
-Why did we need it?
-\item
-The added benefits.
-\end{itemize}
-
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% uHeapLmmm Design
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Objective of uHeapLmmm}
-UHeapLmmm is a lightweight memory allocator. The objective behind uHeapLmmm is to design a minimal concurrent memory allocator that has new features and also fulfills GNU C Library requirements (FIX ME: cite requirements).
-
-\subsection{Design philosophy}
-The objective of uHeapLmmm's new design was to fulfill following requirements:
-\begin{itemize}
-\item It should be concurrent to be used in multi-threaded programs.
+\section{uHeap}
+uHeap is a lightweight memory allocator. The objective behind uHeap is to design a minimal concurrent memory allocator that has new features and also fulfills GNU C Library requirements (FIX ME: cite requirements).
+
+The objective of uHeap's new design was to fulfill following requirements:
+\begin{itemize}
+\item It should be concurrent and thread-safe for multi-threaded programs.
 \item It should avoid global locks, on resources shared across all threads, as much as possible.
 \item It's performance (FIX ME: cite performance benchmarks) should be comparable to the commonly used allocators (FIX ME: cite common allocators).
@@ -55,14 +14,21 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
-\section{Background and previous design of uHeapLmmm}
-uHeapLmmm was originally designed by X in X (FIX ME: add original author after confirming with Peter).
-(FIX ME: make and add figure of previous design with description)
-
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-\section{Distributed design of uHeapLmmm}
-uHeapLmmm's design was reviewed and changed to fulfill new requirements (FIX ME: cite allocator philosophy). For this purpose, following two designs of uHeapLmm were proposed:
-
-\paragraph{Design 1: Decentralized}
+\section{Design choices for uHeap}
+uHeap's design was reviewed and changed to fulfill new requirements (FIX ME: cite allocator philosophy). For this purpose, following two designs of uHeapLmm were proposed:
+
+\paragraph{Design 1: Centralized}
+One heap, but lower bucket sizes are N-shared across KTs.
+This design leverages the fact that 95\% of allocation requests are less than 512 bytes and there are only 3--5 different request sizes.
+When KTs $\le$ N, the important bucket sizes are uncontented.
+When KTs $>$ N, the free buckets are contented.
+Therefore, threads are only contending for a small number of buckets, which are distributed among them to reduce contention.
+\begin{cquote}
+\centering
+\input{AllocDS2}
+\end{cquote}
+Problems: need to know when a kernel thread (KT) is created and destroyed to know when to assign a shared bucket-number.
+When no thread is assigned a bucket number, its free storage is unavailable. All KTs will be contended for one lock on sbrk for their initial allocations (before free-lists gets populated).
+
+\paragraph{Design 2: Decentralized N Heaps}
 Fixed number of heaps: shard the heap into N heaps each with a bump-area allocated from the @sbrk@ area.
 Kernel threads (KT) are assigned to the N heaps.
@@ -77,41 +43,130 @@
 Problems: need to know when a KT is created and destroyed to know when to assign/un-assign a heap to the KT.
 
-\paragraph{Design 2: Centralized}
-One heap, but lower bucket sizes are N-shared across KTs.
-This design leverages the fact that 95\% of allocation requests are less than 512 bytes and there are only 3--5 different request sizes.
-When KTs $\le$ N, the important bucket sizes are uncontented.
-When KTs $>$ N, the free buckets are contented.
-Therefore, threads are only contending for a small number of buckets, which are distributed among them to reduce contention.
-\begin{cquote}
+\paragraph{Design 3: Decentralized Per-thread Heaps}
+Design 3 is similar to design 2 but instead of having an M:N model, it uses a 1:1 model. So, instead of having N heaos and sharing them among M KTs, Design 3 has one heap for each KT.
+Dynamic number of heaps: create a thread-local heap for each kernel thread (KT) with a bump-area allocated from the @sbrk@ area.
+Each KT will have its own exclusive thread-local heap. Heap will be uncontended between KTs regardless how many KTs have been created.
+Operations on @sbrk@ area will still be protected by locks.
+%\begin{cquote}
+%\centering
+%\input{AllocDS3} FIXME add figs
+%\end{cquote}
+Problems: We cannot destroy the heap when a KT exits because our dynamic objects have ownership and they are returned to the heap that created them when the program frees a dynamic object. All dynamic objects point back to their owner heap. If a thread A creates an object O, passes it to another thread B, and A itself exits. When B will free object O, O should return to A's heap so A's heap should be preserved for the lifetime of the whole program as their might be objects in-use of other threads that were allocated by A. Also, we need to know when a KT is created and destroyed to know when to create/destroy a heap for the KT.
+
+\paragraph{Design 4: Decentralized Per-CPU Heaps}
+Design 4 is similar to Design 3 but instead of having a heap for each thread, it creates a heap for each CPU.
+Fixed number of heaps for a machine: create a heap for each CPU with a bump-area allocated from the @sbrk@ area.
+Each CPU will have its own CPU-local heap. When the program does a dynamic memory operation, it will be entertained by the heap of the CPU where the process is currently running on.
+Each CPU will have its own exclusive heap. Just like Design 3(FIXME cite), heap will be uncontended between KTs regardless how many KTs have been created.
+Operations on @sbrk@ area will still be protected by locks.
+To deal with preemtion during a dynamic memory operation, librseq(FIXME cite) will be used to make sure that the whole dynamic memory operation completes on one CPU. librseq's restartable sequences can make it possible to re-run a critical section and undo the current writes if a preemption happened during the critical section's execution.
+%\begin{cquote}
+%\centering
+%\input{AllocDS4} FIXME add figs
+%\end{cquote}
+
+Problems: This approach was slower than the per-thread model. Also, librseq does not provide such restartable sequences to detect preemtions in user-level threading system which is important to us as CFA(FIXME cite) has its own threading system that we want to support.
+
+Out of the four designs, Design 3 was chosen because of the following reasons.
+\begin{itemize}
+\item
+Decentralized designes are better in general as compared to centralized design because their concurrency is better across all bucket-sizes as design 1 shards a few buckets of selected sizes while other designs shards all the buckets. Decentralized designes shard the whole heap which has all the buckets with the addition of sharding sbrk area. So Design 1 was eliminated.
+\item
+Design 2 was eliminated because it has a possibility of contention in-case of KT > N while Design 3 and 4 have no contention in any scenerio.
+\item
+Design 4 was eliminated because it was slower than Design 3 and it provided no way to achieve user-threading safety using librseq. We had to use CFA interruption handling to achive user-threading safety which has some cost to it. Desing 4 was already slower than Design 3, adding cost of interruption handling on top of that would have made it even slower.
+\end{itemize}
+
+
+\subsection{Advantages of distributed design}
+
+The distributed design of uHeap is concurrent to work in multi-threaded applications.
+
+Some key benefits of the distributed design of uHeap are as follows:
+
+\begin{itemize}
+\item
+The bump allocation is concurrent as memory taken from sbrk is sharded across all heaps as bump allocation reserve. The call to sbrk will be protected using locks but bump allocation (on memory taken from sbrk) will not be contended once the sbrk call has returned.
+\item
+Low or almost no contention on heap resources.
+\item
+It is possible to use sharing and stealing techniques to share/find unused storage, when a free list is unused or empty.
+\item
+Distributed design avoids unnecassry locks on resources shared across all KTs.
+\end{itemize}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\section{uHeap Structure}
+
+As described in (FIXME cite 2.4) uHeap uses following features of multi-threaded memory allocators.
+\begin{itemize}
+\item
+uHeap has multiple heaps without a global heap and uses 1:1 model. (FIXME cite 2.5 1:1 model)
+\item
+uHeap uses object ownership. (FIXME cite 2.5.2)
+\item
+uHeap does not use object containers (FIXME cite 2.6) or any coalescing technique. Instead each dynamic object allocated by uHeap has a header than contains bookkeeping information.
+\item
+Each thread-local heap in uHeap has its own allocation buffer that is taken from the system using sbrk() call. (FIXME cite 2.7)
+\item
+Unless a heap is freeing an object that is owned by another thread's heap or heap is using sbrk() system call, uHeap is mostly lock-free which eliminates most of the contention on shared resources. (FIXME cite 2.8)
+\end{itemize}
+
+As uHeap uses a heap per-thread model to reduce contention on heap resources, we manage a list of heaps (heap-list) that can be used by threads. The list is empty at the start of the program. When a kernel thread (KT) is created, we check if heap-list is empty. If no then a heap is removed from the heap-list and is given to this new KT to use exclusively. If yes then a new heap object is created in dynamic memory and is given to this new KT to use exclusively. When a KT exits, its heap is not destroyed but instead its heap is put on the heap-list and is ready to be reused by new KTs.
+
+This reduces the memory footprint as the objects on free-lists of a KT that has exited can be reused by a new KT. Also, we preserve all the heaps that were created during the lifetime of the program till the end of the program. uHeap uses object ownership where an object is freed to the free-buckets of the heap that allocated it. Even after a KT A has exited, its heap has to be preserved as there might be objects in-use of other threads that were initially allocated by A and the passed to other threads.
+
+\begin{figure}
 \centering
-\input{AllocDS2}
-\end{cquote}
-Problems: need to know when a kernel thread (KT) is created and destroyed to know when to assign a shared bucket-number.
-When no thread is assigned a bucket number, its free storage is unavailable. All KTs will be contended for one lock on sbrk for their initial allocations (before free-lists gets populated).
-
-Out of the two designs, Design 1 was chosen because it's concurrency is better across all bucket-sizes as design-2 shards a few buckets of selected sizes while design-1 shards all the buckets. Design-2 shards the whole heap which has all the buckets with the addition of sharding sbrk area.
-
-\subsection{Advantages of distributed design}
-The distributed design of uHeapLmmm is concurrent to work in multi-threaded applications.
-
-Some key benefits of the distributed design of uHeapLmmm are as follows:
-
-\begin{itemize}
-\item
-The bump allocation is concurrent as memory taken from sbrk is sharded across all heaps as bump allocation reserve. The lock on bump allocation (on memory taken from sbrk) will only be contended if KTs $<$ N. The contention on sbrk area is less likely as it will only happen in the case if heaps assigned to two KTs get short of bump allocation reserve simultanously.
-\item
-N heaps are created at the start of the program and destroyed at the end of program. When a KT is created, we only assign it to one of the heaps. When a KT is destroyed, we only dissociate it from the assigned heap but we do not destroy that heap. That heap will go back to our pool-of-heaps, ready to be used by some new KT. And if that heap was shared among multiple KTs (like the case of KTs $<$ N) then, on deletion of one KT, that heap will be still in-use of the other KTs. This will prevent creation and deletion of heaps during run-time as heaps are re-usable which helps in keeping low-memory footprint.
-\item
-It is possible to use sharing and stealing techniques to share/find unused storage, when a free list is unused or empty.
-\item
-Distributed design avoids unnecassry locks on resources shared across all KTs.
-\end{itemize}
-
-FIX ME: Cite performance comparison of the two heap designs if required
+\includegraphics[width=0.65\textwidth]{figures/NewHeapStructure.eps}
+\caption{HeapStructure}
+\label{fig:heapStructureFig}
+\end{figure}
+
+Each heap uses seggregated free-buckets that have free objects of a specific size. Each free-bucket of a specific size has following 2 lists in it:
+\begin{itemize}
+\item
+Free list is used when a thread is freeing an object that is owned by its own heap so free list does not use any locks/atomic-operations as it is only used by the owner KT.
+\item
+Away list is used when a thread A is freeing an object that is owned by another KT B's heap. This object should be freed to the owner heap (B's heap) so A will place the object on the away list of B. Away list is lock protected as it is shared by all other threads.
+\end{itemize}
+
+When a dynamic object of a size S is requested. The thread-local heap will check if S is greater than or equal to the mmap threshhold. Any request larger than the mmap threshhold is fulfilled by allocating an mmap area of that size and such requests are not allocated on sbrk area. The value of this threshhold can be changed using mallopt routine but the new value should not be larger than our biggest free-bucket size.
+
+Algorithm~\ref{alg:heapObjectAlloc} briefly shows how an allocation request is fulfilled.
+
+\begin{algorithm}
+\caption{Dynamic object allocation of size S}\label{alg:heapObjectAlloc}
+\begin{algorithmic}[1]
+\State $\textit{O} \gets \text{NULL}$
+\If {$S < \textit{mmap-threshhold}$}
+	\State $\textit{B} \gets (\text{smallest free-bucket} \geq S)$
+	\If {$\textit{B's free-list is empty}$}
+		\If {$\textit{B's away-list is empty}$}
+			\If {$\textit{heap's allocation buffer} < S$}
+				\State $\text{get allocation buffer using system call sbrk()}$
+			\EndIf
+			\State $\textit{O} \gets \text{bump allocate an object of size S from allocation buffer}$
+		\Else
+			\State $\textit{merge B's away-list into free-list}$
+			\State $\textit{O} \gets \text{pop an object from B's free-list}$
+		\EndIf
+	\Else
+		\State $\textit{O} \gets \text{pop an object from B's free-list}$
+	\EndIf
+	\State $\textit{O's owner} \gets \text{B}$
+\Else
+	\State $\textit{O} \gets \text{allocate dynamic memory using system call mmap with size S}$
+\EndIf
+\State $\Return \textit{ O}$
+\end{algorithmic}
+\end{algorithm}
+
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 
 \section{Added Features and Methods}
-To improve the UHeapLmmm allocator (FIX ME: cite uHeapLmmm) interface and make it more user friendly, we added a few more routines to the C allocator. Also, we built a \CFA (FIX ME: cite cforall) interface on top of C interface to increase the usability of the allocator.
+To improve the uHeap allocator (FIX ME: cite uHeap) interface and make it more user friendly, we added a few more routines to the C allocator. Also, we built a \CFA (FIX ME: cite cforall) interface on top of C interface to increase the usability of the allocator.
 
 \subsection{C Interface}
@@ -207,5 +262,5 @@
 @addr@: the address of the currently allocated dynamic object.
 \end{itemize}
-@malloc_alignment@ returns the alignment of the given dynamic object. On failure, it return the value of default alignment of the uHeapLmmm allocator.
+@malloc_alignment@ returns the alignment of the given dynamic object. On failure, it return the value of default alignment of the uHeap allocator.
 
 \subsection{\lstinline{bool malloc_zero_fill( void * addr )}}
@@ -247,5 +302,5 @@
 
 \subsection{\CFA Malloc Interface}
-We added some routines to the malloc interface of \CFA. These routines can only be used in \CFA and not in our standalone uHeapLmmm allocator as these routines use some features that are only provided by \CFA and not by C. It makes the allocator even more usable to the programmers.
+We added some routines to the malloc interface of \CFA. These routines can only be used in \CFA and not in our standalone uHeap allocator as these routines use some features that are only provided by \CFA and not by C. It makes the allocator even more usable to the programmers.
 \CFA provides the liberty to know the returned type of a call to the allocator. So, mainly in these added routines, we removed the object size parameter from the routine as allocator can calculate the size of the object from the returned type.
 
@@ -378,5 +433,5 @@
 
 \subsection{Alloc Interface}
-In addition to improve allocator interface both for \CFA and our standalone allocator uHeapLmmm in C. We also added a new alloc interface in \CFA that increases usability of dynamic memory allocation.
+In addition to improve allocator interface both for \CFA and our standalone allocator uHeap in C. We also added a new alloc interface in \CFA that increases usability of dynamic memory allocation.
 This interface helps programmers in three major ways.
 
Index: doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ doc/theses/mubeen_zulfiqar_MMath/benchmarks.tex	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -216,28 +216,2 @@
 \paragraph{Relevant Knobs}
 *** FIX ME: Insert Relevant Knobs
-
-
-
-\section{Existing Memory Allocators}
-With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
-
-\paragraph{dlmalloc}
-dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
-
-\paragraph{hoard}
-Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
-
-\paragraph{jemalloc}
-jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
-
-\paragraph{ptmalloc}
-ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
-
-\paragraph{rpmalloc}
-rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
-
-\paragraph{tbb malloc}
-tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
-
-\paragraph{tc malloc}
-tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
Index: doc/theses/mubeen_zulfiqar_MMath/performance.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/performance.tex	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ doc/theses/mubeen_zulfiqar_MMath/performance.tex	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -18,4 +18,42 @@
 \noindent
 ====================
+
+\section{Machine Specification}
+
+The performance experiments were run on three different multicore systems to determine if there is consistency across platforms:
+\begin{itemize}
+\item
+AMD EPYC 7662, 64-core socket $\times$ 2, 2.0 GHz
+\item
+Huawei ARM TaiShan 2280 V2 Kunpeng 920, 24-core socket $\times$ 4, 2.6 GHz
+\item
+Intel Xeon Gold 5220R, 48-core socket $\times$ 2, 2.20GHz
+\end{itemize}
+
+
+\section{Existing Memory Allocators}
+With dynamic allocation being an important feature of C, there are many stand-alone memory allocators that have been designed for different purposes. For this thesis, we chose 7 of the most popular and widely used memory allocators.
+
+\paragraph{dlmalloc}
+dlmalloc (FIX ME: cite allocator) is a thread-safe allocator that is single threaded and single heap. dlmalloc maintains free-lists of different sizes to store freed dynamic memory. (FIX ME: cite wasik)
+
+\paragraph{hoard}
+Hoard (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and using a heap layer framework. It has per-thread heaps that have thread-local free-lists, and a global shared heap. (FIX ME: cite wasik)
+
+\paragraph{jemalloc}
+jemalloc (FIX ME: cite allocator) is a thread-safe allocator that uses multiple arenas. Each thread is assigned an arena. Each arena has chunks that contain contagious memory regions of same size. An arena has multiple chunks that contain regions of multiple sizes.
+
+\paragraph{ptmalloc}
+ptmalloc (FIX ME: cite allocator) is a modification of dlmalloc. It is a thread-safe multi-threaded memory allocator that uses multiple heaps. ptmalloc heap has similar design to dlmalloc's heap.
+
+\paragraph{rpmalloc}
+rpmalloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses per-thread heap. Each heap has multiple size-classes and each size-class contains memory regions of the relevant size.
+
+\paragraph{tbb malloc}
+tbb malloc (FIX ME: cite allocator) is a thread-safe allocator that is multi-threaded and uses private heap for each thread. Each private-heap has multiple bins of different sizes. Each bin contains free regions of the same size.
+
+\paragraph{tc malloc}
+tcmalloc (FIX ME: cite allocator) is a thread-safe allocator. It uses per-thread cache to store free objects that prevents contention on shared resources in multi-threaded application. A central free-list is used to refill per-thread cache when it gets empty.
+
 
 \section{Memory Allocators}
Index: doc/theses/mubeen_zulfiqar_MMath/uw-ethesis.tex
===================================================================
--- doc/theses/mubeen_zulfiqar_MMath/uw-ethesis.tex	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ doc/theses/mubeen_zulfiqar_MMath/uw-ethesis.tex	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -86,4 +86,7 @@
 \usepackage{tabularx}
 \usepackage{subfigure}
+
+\usepackage{algorithm}
+\usepackage{algpseudocode}
 
 % Hyperlinks make it very easy to navigate an electronic document.
Index: libcfa/src/concurrency/io.cfa
===================================================================
--- libcfa/src/concurrency/io.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/io.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -175,5 +175,5 @@
 			/* paranoid */ verify( ! __preemption_enabled() );
 
-			ctx.proc->io.pending = false;
+			__atomic_store_n(&ctx.proc->io.pending, false, __ATOMIC_RELAXED);
 		}
 
@@ -287,5 +287,5 @@
 	//=============================================================================================
 	// submission
-	static inline void __submit( struct $io_context * ctx, __u32 idxs[], __u32 have, bool lazy) {
+	static inline void __submit_only( struct $io_context * ctx, __u32 idxs[], __u32 have) {
 		// We can proceed to the fast path
 		// Get the right objects
@@ -304,6 +304,12 @@
 		sq.to_submit += have;
 
-		ctx->proc->io.pending = true;
-		ctx->proc->io.dirty   = true;
+		__atomic_store_n(&ctx->proc->io.pending, true, __ATOMIC_RELAXED);
+		__atomic_store_n(&ctx->proc->io.dirty  , true, __ATOMIC_RELAXED);
+	}
+
+	static inline void __submit( struct $io_context * ctx, __u32 idxs[], __u32 have, bool lazy) {
+		__sub_ring_t & sq = ctx->sq;
+		__submit_only(ctx, idxs, have);
+
 		if(sq.to_submit > 30) {
 			__tls_stats()->io.flush.full++;
@@ -402,8 +408,12 @@
 // I/O Arbiter
 //=============================================================================================
-	static inline void block(__outstanding_io_queue & queue, __outstanding_io & item) {
+	static inline bool enqueue(__outstanding_io_queue & queue, __outstanding_io & item) {
+		bool was_empty;
+
 		// Lock the list, it's not thread safe
 		lock( queue.lock __cfaabi_dbg_ctx2 );
 		{
+			was_empty = empty(queue.queue);
+
 			// Add our request to the list
 			add( queue.queue, item );
@@ -414,5 +424,5 @@
 		unlock( queue.lock );
 
-		wait( item.sem );
+		return was_empty;
 	}
 
@@ -432,5 +442,7 @@
 		pa.want = want;
 
-		block(this.pending, (__outstanding_io&)pa);
+		enqueue(this.pending, (__outstanding_io&)pa);
+
+		wait( pa.sem );
 
 		return pa.ctx;
@@ -485,5 +497,14 @@
 		ei.lazy = lazy;
 
-		block(ctx->ext_sq, (__outstanding_io&)ei);
+		bool we = enqueue(ctx->ext_sq, (__outstanding_io&)ei);
+
+		__atomic_store_n(&ctx->proc->io.pending, true, __ATOMIC_SEQ_CST);
+
+		if( we ) {
+			sigval_t value = { PREEMPT_IO };
+			pthread_sigqueue(ctx->proc->kernel_thread, SIGUSR1, value);
+		}
+
+		wait( ei.sem );
 
 		__cfadbg_print_safe(io, "Kernel I/O : %u submitted from arbiter\n", have);
@@ -501,5 +522,5 @@
 					__external_io & ei = (__external_io&)drop( ctx.ext_sq.queue );
 
-					__submit(&ctx, ei.idxs, ei.have, ei.lazy);
+					__submit_only(&ctx, ei.idxs, ei.have);
 
 					post( ei.sem );
Index: libcfa/src/concurrency/io/setup.cfa
===================================================================
--- libcfa/src/concurrency/io/setup.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/io/setup.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -56,4 +56,5 @@
 
 	#include "bitmanip.hfa"
+	#include "fstream.hfa"
 	#include "kernel_private.hfa"
 	#include "thread.hfa"
@@ -258,4 +259,13 @@
 		struct __sub_ring_t & sq = this.sq;
 		struct __cmp_ring_t & cq = this.cq;
+		{
+			__u32 fhead = sq.free_ring.head;
+			__u32 ftail = sq.free_ring.tail;
+
+			__u32 total = *sq.num;
+			__u32 avail = ftail - fhead;
+
+			if(avail != total) abort | "Processor (" | (void*)this.proc | ") tearing down ring with" | (total - avail) | "entries allocated but not submitted, out of" | total;
+		}
 
 		// unmap the submit queue entries
Index: libcfa/src/concurrency/io/types.hfa
===================================================================
--- libcfa/src/concurrency/io/types.hfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/io/types.hfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -23,4 +23,5 @@
 #include "bits/locks.hfa"
 #include "bits/queue.hfa"
+#include "iofwd.hfa"
 #include "kernel/fwd.hfa"
 
@@ -170,21 +171,2 @@
 	// void __ioctx_prepare_block($io_context & ctx);
 #endif
-
-//-----------------------------------------------------------------------
-// IO user data
-struct io_future_t {
-	future_t self;
-	__s32 result;
-};
-
-static inline {
-	thread$ * fulfil( io_future_t & this, __s32 result, bool do_unpark = true ) {
-		this.result = result;
-		return fulfil(this.self, do_unpark);
-	}
-
-	// Wait for the future to be fulfilled
-	bool wait     ( io_future_t & this ) { return wait     (this.self); }
-	void reset    ( io_future_t & this ) { return reset    (this.self); }
-	bool available( io_future_t & this ) { return available(this.self); }
-}
Index: libcfa/src/concurrency/iofwd.hfa
===================================================================
--- libcfa/src/concurrency/iofwd.hfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/iofwd.hfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -19,4 +19,5 @@
 extern "C" {
 	#include <asm/types.h>
+	#include <sys/stat.h> // needed for mode_t
 	#if CFA_HAVE_LINUX_IO_URING_H
 		#include <linux/io_uring.h>
@@ -24,4 +25,5 @@
 }
 #include "bits/defs.hfa"
+#include "kernel/fwd.hfa"
 #include "time.hfa"
 
@@ -47,5 +49,4 @@
 
 struct cluster;
-struct io_future_t;
 struct $io_context;
 
@@ -57,4 +58,23 @@
 
 struct io_uring_sqe;
+
+//-----------------------------------------------------------------------
+// IO user data
+struct io_future_t {
+	future_t self;
+	__s32 result;
+};
+
+static inline {
+	thread$ * fulfil( io_future_t & this, __s32 result, bool do_unpark = true ) {
+		this.result = result;
+		return fulfil(this.self, do_unpark);
+	}
+
+	// Wait for the future to be fulfilled
+	bool wait     ( io_future_t & this ) { return wait     (this.self); }
+	void reset    ( io_future_t & this ) { return reset    (this.self); }
+	bool available( io_future_t & this ) { return available(this.self); }
+}
 
 //----------
@@ -133,2 +153,21 @@
 // Check if a function is blocks a only the user thread
 bool has_user_level_blocking( fptr_t func );
+
+#if CFA_HAVE_LINUX_IO_URING_H
+	static inline void zero_sqe(struct io_uring_sqe * sqe) {
+		sqe->flags = 0;
+		sqe->ioprio = 0;
+		sqe->fd = 0;
+		sqe->off = 0;
+		sqe->addr = 0;
+		sqe->len = 0;
+		sqe->fsync_flags = 0;
+		sqe->__pad2[0] = 0;
+		sqe->__pad2[1] = 0;
+		sqe->__pad2[2] = 0;
+		sqe->fd = 0;
+		sqe->off = 0;
+		sqe->addr = 0;
+		sqe->len = 0;
+	}
+#endif
Index: libcfa/src/concurrency/kernel.cfa
===================================================================
--- libcfa/src/concurrency/kernel.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/kernel.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -251,5 +251,5 @@
 			if( __atomic_load_n(&this->do_terminate, __ATOMIC_SEQ_CST) ) break MAIN_LOOP;
 
-			if(this->io.pending && !this->io.dirty) {
+			if(__atomic_load_n(&this->io.pending, __ATOMIC_RELAXED) && !__atomic_load_n(&this->io.dirty, __ATOMIC_RELAXED)) {
 				__IO_STATS__(true, io.flush.dirty++; )
 				__cfa_io_flush( this, 0 );
Index: libcfa/src/concurrency/kernel.hfa
===================================================================
--- libcfa/src/concurrency/kernel.hfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/kernel.hfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -92,6 +92,6 @@
 	struct {
 		$io_context * ctx;
-		bool pending;
-		bool dirty;
+		volatile bool pending;
+		volatile bool dirty;
 	} io;
 
Index: libcfa/src/concurrency/kernel/fwd.hfa
===================================================================
--- libcfa/src/concurrency/kernel/fwd.hfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/kernel/fwd.hfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -347,5 +347,5 @@
 					struct oneshot * want = expected == 0p ? 1p : 2p;
 					if(__atomic_compare_exchange_n(&this.ptr, &expected, want, false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) {
-						if( expected == 0p ) { /* paranoid */ verify( this.ptr == 1p); return 0p; }
+						if( expected == 0p ) { return 0p; }
 						thread$ * ret = post( *expected, do_unpark );
 						__atomic_store_n( &this.ptr, 1p, __ATOMIC_SEQ_CST);
Index: libcfa/src/concurrency/kernel_private.hfa
===================================================================
--- libcfa/src/concurrency/kernel_private.hfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/kernel_private.hfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -60,4 +60,10 @@
 extern bool __preemption_enabled();
 
+enum {
+	PREEMPT_NORMAL    = 0,
+	PREEMPT_TERMINATE = 1,
+	PREEMPT_IO = 2,
+};
+
 static inline void __disable_interrupts_checked() {
 	/* paranoid */ verify( __preemption_enabled() );
Index: libcfa/src/concurrency/preemption.cfa
===================================================================
--- libcfa/src/concurrency/preemption.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ libcfa/src/concurrency/preemption.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -96,9 +96,4 @@
 	lock{};
 }
-
-enum {
-	PREEMPT_NORMAL    = 0,
-	PREEMPT_TERMINATE = 1,
-};
 
 //=============================================================================================
@@ -664,4 +659,5 @@
 	choose(sfp->si_value.sival_int) {
 		case PREEMPT_NORMAL   : ;// Normal case, nothing to do here
+		case PREEMPT_IO       : ;// I/O asked to stop spinning, nothing to do here
 		case PREEMPT_TERMINATE: verify( __atomic_load_n( &__cfaabi_tls.this_processor->do_terminate, __ATOMIC_SEQ_CST ) );
 		default:
Index: src/AST/GenericSubstitution.cpp
===================================================================
--- src/AST/GenericSubstitution.cpp	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/AST/GenericSubstitution.cpp	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -45,5 +45,5 @@
 			visit_children = false;
 			const AggregateDecl * aggr = ty->aggr();
-			sub = TypeSubstitution{ aggr->params.begin(), aggr->params.end(), ty->params.begin() };
+			sub = TypeSubstitution( aggr->params, ty->params );
 		}
 
Index: src/AST/TypeSubstitution.hpp
===================================================================
--- src/AST/TypeSubstitution.hpp	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/AST/TypeSubstitution.hpp	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -37,4 +37,6 @@
   public:
 	TypeSubstitution();
+	template< typename FormalContainer, typename ActualContainer >
+	TypeSubstitution( FormalContainer formals, ActualContainer actuals );
 	template< typename FormalIterator, typename ActualIterator >
 	TypeSubstitution( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin );
@@ -76,6 +78,8 @@
 	bool empty() const;
 
+	template< typename FormalContainer, typename ActualContainer >
+	void addAll( FormalContainer formals, ActualContainer actuals );
 	template< typename FormalIterator, typename ActualIterator >
-	void add( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin );
+	void addAll( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin );
 
 	/// create a new TypeSubstitution using bindings from env containing all of the type variables in expr
@@ -112,7 +116,24 @@
 };
 
+template< typename FormalContainer, typename ActualContainer >
+TypeSubstitution::TypeSubstitution( FormalContainer formals, ActualContainer actuals ) {
+	assert( formals.size() == actuals.size() );
+	addAll( formals.begin(), formals.end(), actuals.begin() );
+}
+
+template< typename FormalIterator, typename ActualIterator >
+TypeSubstitution::TypeSubstitution( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin ) {
+	addAll( formalBegin, formalEnd, actualBegin );
+}
+
+template< typename FormalContainer, typename ActualContainer >
+void TypeSubstitution::addAll( FormalContainer formals, ActualContainer actuals ) {
+	assert( formals.size() == actuals.size() );
+	addAll( formals.begin(), formals.end(), actuals.begin() );
+}
+
 // this is the only place where type parameters outside a function formal may be substituted.
 template< typename FormalIterator, typename ActualIterator >
-void TypeSubstitution::add( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin ) {
+void TypeSubstitution::addAll( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin ) {
 	// FormalIterator points to a TypeDecl
 	// ActualIterator points to a Type
@@ -129,16 +150,8 @@
 			} // if
 		} else {
-			
+			// Is this an error?
 		} // if
 	} // for
 }
-
-
-
-template< typename FormalIterator, typename ActualIterator >
-TypeSubstitution::TypeSubstitution( FormalIterator formalBegin, FormalIterator formalEnd, ActualIterator actualBegin ) {
-	add( formalBegin, formalEnd, actualBegin );
-}
-
 
 } // namespace ast
Index: src/Common/Examine.cc
===================================================================
--- src/Common/Examine.cc	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Common/Examine.cc	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -5,16 +5,18 @@
 // file "LICENCE" distributed with Cforall.
 //
-// Examine.h --
+// Examine.cc -- Helpers for examining AST code.
 //
 // Author           : Andrew Beach
 // Created On       : Wed Sept 2 14:02 2020
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Sep  8 12:15 2020
-// Update Count     : 0
+// Last Modified On : Fri Dec 10 10:27 2021
+// Update Count     : 1
 //
 
 #include "Common/Examine.h"
 
+#include "AST/Type.hpp"
 #include "CodeGen/OperatorTable.h"
+#include "InitTweak/InitTweak.h"
 
 DeclarationWithType * isMainFor( FunctionDecl * func, AggregateDecl::Aggregate kind ) {
@@ -36,4 +38,35 @@
 
 namespace {
+
+// getTypeofThis but does some extra checks used in this module.
+const ast::Type * getTypeofThisSolo( const ast::FunctionDecl * func ) {
+	if ( 1 != func->params.size() ) {
+		return nullptr;
+	}
+	auto ref = func->type->params.front().as<ast::ReferenceType>();
+	return (ref) ? ref->base : nullptr;
+}
+
+}
+
+const ast::DeclWithType * isMainFor(
+		const ast::FunctionDecl * func, ast::AggregateDecl::Aggregate kind ) {
+	if ( "main" != func->name ) return nullptr;
+	if ( 1 != func->params.size() ) return nullptr;
+
+	auto param = func->params.front();
+
+	auto type = dynamic_cast<const ast::ReferenceType *>( param->get_type() );
+	if ( !type ) return nullptr;
+
+	auto obj = type->base.as<ast::StructInstType>();
+	if ( !obj ) return nullptr;
+
+	if ( kind != obj->base->kind ) return nullptr;
+
+	return param;
+}
+
+namespace {
 	Type * getDestructorParam( FunctionDecl * func ) {
 		if ( !CodeGen::isDestructor( func->name ) ) return nullptr;
@@ -48,4 +81,11 @@
 		return nullptr;
 	}
+
+const ast::Type * getDestructorParam( const ast::FunctionDecl * func ) {
+	if ( !CodeGen::isDestructor( func->name ) ) return nullptr;
+	//return InitTweak::getParamThis( func )->type;
+	return getTypeofThisSolo( func );
+}
+
 }
 
@@ -57,2 +97,11 @@
 	return false;
 }
+
+bool isDestructorFor(
+		const ast::FunctionDecl * func, const ast::StructDecl * type_decl ) {
+	if ( const ast::Type * type = getDestructorParam( func ) ) {
+		auto stype = dynamic_cast<const ast::StructInstType *>( type );
+		return stype && stype->base.get() == type_decl;
+	}
+	return false;
+}
Index: src/Common/Examine.h
===================================================================
--- src/Common/Examine.h	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Common/Examine.h	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -5,19 +5,24 @@
 // file "LICENCE" distributed with Cforall.
 //
-// Examine.h --
+// Examine.h -- Helpers for examining AST code.
 //
 // Author           : Andrew Beach
 // Created On       : Wed Sept 2 13:57 2020
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Sep  8 12:08 2020
-// Update Count     : 0
+// Last Modified On : Fri Dec 10 10:28 2021
+// Update Count     : 1
 //
 
+#include "AST/Decl.hpp"
 #include "SynTree/Declaration.h"
 
 /// Check if this is a main function for a type of an aggregate kind.
 DeclarationWithType * isMainFor( FunctionDecl * func, AggregateDecl::Aggregate kind );
+const ast::DeclWithType * isMainFor(
+	const ast::FunctionDecl * func, ast::AggregateDecl::Aggregate kind );
 // Returns a pointer to the parameter if true, nullptr otherwise.
 
 /// Check if this function is a destructor for the given structure.
 bool isDestructorFor( FunctionDecl * func, StructDecl * type_decl );
+bool isDestructorFor(
+	const ast::FunctionDecl * func, const ast::StructDecl * type );
Index: src/Concurrency/Keywords.cc
===================================================================
--- src/Concurrency/Keywords.cc	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Concurrency/Keywords.cc	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -422,6 +422,7 @@
 			;
 		else if ( auto param = isMainFor( decl, cast_target ) ) {
-			// This should never trigger.
-			assert( vtable_decl );
+			if ( !vtable_decl ) {
+				SemanticError( decl, context_error );
+			}
 			// Should be safe because of isMainFor.
 			StructInstType * struct_type = static_cast<StructInstType *>(
Index: src/Concurrency/KeywordsNew.cpp
===================================================================
--- src/Concurrency/KeywordsNew.cpp	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Concurrency/KeywordsNew.cpp	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -10,6 +10,6 @@
 // Created On       : Tue Nov 16  9:53:00 2021
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Dec  1 11:24:00 2021
-// Update Count     : 1
+// Last Modified On : Fri Mar 11 10:40:00 2022
+// Update Count     : 2
 //
 
@@ -18,10 +18,15 @@
 #include "AST/Copy.hpp"
 #include "AST/Decl.hpp"
+#include "AST/Expr.hpp"
 #include "AST/Pass.hpp"
 #include "AST/Stmt.hpp"
+#include "AST/DeclReplacer.hpp"
 #include "AST/TranslationUnit.hpp"
 #include "CodeGen/OperatorTable.h"
+#include "Common/Examine.h"
 #include "Common/utility.h"
+#include "ControlStruct/LabelGeneratorNew.hpp"
 #include "InitTweak/InitTweak.h"
+#include "Virtual/Tables.h"
 
 namespace Concurrency {
@@ -29,9 +34,841 @@
 namespace {
 
-inline static bool isThread( const ast::DeclWithType * decl ) {
+// --------------------------------------------------------------------------
+// Loose Helper Functions:
+
+/// Detect threads constructed with the keyword thread.
+bool isThread( const ast::DeclWithType * decl ) {
 	auto baseType = decl->get_type()->stripDeclarator();
 	auto instType = dynamic_cast<const ast::StructInstType *>( baseType );
 	if ( nullptr == instType ) { return false; }
 	return instType->base->is_thread();
+}
+
+/// Get the virtual type id if given a type name.
+std::string typeIdType( std::string const & exception_name ) {
+	return exception_name.empty() ? std::string()
+		: Virtual::typeIdType( exception_name );
+}
+
+/// Get the vtable type name if given a type name.
+std::string vtableTypeName( std::string const & exception_name ) {
+	return exception_name.empty() ? std::string()
+		: Virtual::vtableTypeName( exception_name );
+}
+
+static ast::Type * mutate_under_references( ast::ptr<ast::Type>& type ) {
+	ast::Type * mutType = type.get_and_mutate();
+	for ( ast::ReferenceType * mutRef
+		; (mutRef = dynamic_cast<ast::ReferenceType *>( mutType ))
+		; mutType = mutRef->base.get_and_mutate() );
+	return mutType;
+}
+
+// Describe that it adds the generic parameters and the uses of the generic
+// parameters on the function and first "this" argument.
+ast::FunctionDecl * fixupGenerics(
+		const ast::FunctionDecl * func, const ast::StructDecl * decl ) {
+	const CodeLocation & location = decl->location;
+	// We have to update both the declaration
+	auto mutFunc = ast::mutate( func );
+	auto mutType = mutFunc->type.get_and_mutate();
+
+	if ( decl->params.empty() ) {
+		return mutFunc;
+	}
+
+	assert( 0 != mutFunc->params.size() );
+	assert( 0 != mutType->params.size() );
+
+	// Add the "forall" clause information.
+	for ( const ast::ptr<ast::TypeDecl> & typeParam : decl->params ) {
+		auto typeDecl = ast::deepCopy( typeParam );
+		mutFunc->type_params.push_back( typeDecl );
+		mutType->forall.push_back(
+			new ast::TypeInstType( typeDecl->name, typeDecl ) );
+		for ( auto & assertion : typeDecl->assertions ) {
+			mutFunc->assertions.push_back( assertion );
+			mutType->assertions.emplace_back(
+				new ast::VariableExpr( location, assertion ) );
+		}
+		typeDecl->assertions.clear();
+	}
+
+	// Even chain_mutate is not powerful enough for this:
+	ast::ptr<ast::Type>& paramType = strict_dynamic_cast<ast::ObjectDecl *>(
+		mutFunc->params[0].get_and_mutate() )->type;
+	auto paramTypeInst = strict_dynamic_cast<ast::StructInstType *>(
+		mutate_under_references( paramType ) );
+	auto typeParamInst = strict_dynamic_cast<ast::StructInstType *>(
+		mutate_under_references( mutType->params[0] ) );
+
+	for ( const ast::ptr<ast::TypeDecl> & typeDecl : mutFunc->type_params ) {
+		paramTypeInst->params.push_back(
+			new ast::TypeExpr( location,
+				new ast::TypeInstType( typeDecl->name, typeDecl ) ) );
+		typeParamInst->params.push_back(
+			new ast::TypeExpr( location,
+				new ast::TypeInstType( typeDecl->name, typeDecl ) ) );
+	}
+
+	return mutFunc;
+}
+
+// --------------------------------------------------------------------------
+struct ConcurrentSueKeyword : public ast::WithDeclsToAdd<> {
+	ConcurrentSueKeyword(
+		std::string&& type_name, std::string&& field_name,
+		std::string&& getter_name, std::string&& context_error,
+		std::string&& exception_name,
+		bool needs_main, ast::AggregateDecl::Aggregate cast_target
+	) :
+		type_name( type_name ), field_name( field_name ),
+		getter_name( getter_name ), context_error( context_error ),
+		exception_name( exception_name ),
+		typeid_name( typeIdType( exception_name ) ),
+		vtable_name( vtableTypeName( exception_name ) ),
+		needs_main( needs_main ), cast_target( cast_target )
+	{}
+
+	virtual ~ConcurrentSueKeyword() {}
+
+	const ast::Decl * postvisit( const ast::StructDecl * decl );
+	const ast::DeclWithType * postvisit( const ast::FunctionDecl * decl );
+	const ast::Expr * postvisit( const ast::KeywordCastExpr * expr );
+
+	struct StructAndField {
+		const ast::StructDecl * decl;
+		const ast::ObjectDecl * field;
+	};
+
+	const ast::StructDecl * handleStruct( const ast::StructDecl * );
+	void handleMain( const ast::FunctionDecl *, const ast::StructInstType * );
+	void addTypeId( const ast::StructDecl * );
+	void addVtableForward( const ast::StructDecl * );
+	const ast::FunctionDecl * forwardDeclare( const ast::StructDecl * );
+	StructAndField addField( const ast::StructDecl * );
+	void addGetRoutines( const ast::ObjectDecl *, const ast::FunctionDecl * );
+	void addLockUnlockRoutines( const ast::StructDecl * );
+
+private:
+	const std::string type_name;
+	const std::string field_name;
+	const std::string getter_name;
+	const std::string context_error;
+	const std::string exception_name;
+	const std::string typeid_name;
+	const std::string vtable_name;
+	const bool needs_main;
+	const ast::AggregateDecl::Aggregate cast_target;
+
+	const ast::StructDecl   * type_decl = nullptr;
+	const ast::FunctionDecl * dtor_decl = nullptr;
+	const ast::StructDecl * except_decl = nullptr;
+	const ast::StructDecl * typeid_decl = nullptr;
+	const ast::StructDecl * vtable_decl = nullptr;
+};
+
+// Handles thread type declarations:
+//
+// thread Mythread {                         struct MyThread {
+//  int data;                                  int data;
+//  a_struct_t more_data;                      a_struct_t more_data;
+//                                =>             thread$ __thrd_d;
+// };                                        };
+//                                           static inline thread$ * get_thread( MyThread * this ) { return &this->__thrd_d; }
+//
+struct ThreadKeyword final : public ConcurrentSueKeyword {
+	ThreadKeyword() : ConcurrentSueKeyword(
+		"thread$",
+		"__thrd",
+		"get_thread",
+		"thread keyword requires threads to be in scope, add #include <thread.hfa>\n",
+		"ThreadCancelled",
+		true,
+		ast::AggregateDecl::Thread )
+	{}
+
+	virtual ~ThreadKeyword() {}
+};
+
+// Handles coroutine type declarations:
+//
+// coroutine MyCoroutine {                   struct MyCoroutine {
+//  int data;                                  int data;
+//  a_struct_t more_data;                      a_struct_t more_data;
+//                                =>             coroutine$ __cor_d;
+// };                                        };
+//                                           static inline coroutine$ * get_coroutine( MyCoroutine * this ) { return &this->__cor_d; }
+//
+struct CoroutineKeyword final : public ConcurrentSueKeyword {
+	CoroutineKeyword() : ConcurrentSueKeyword(
+		"coroutine$",
+		"__cor",
+		"get_coroutine",
+		"coroutine keyword requires coroutines to be in scope, add #include <coroutine.hfa>\n",
+		"CoroutineCancelled",
+		true,
+		ast::AggregateDecl::Coroutine )
+	{}
+
+	virtual ~CoroutineKeyword() {}
+};
+
+// Handles monitor type declarations:
+//
+// monitor MyMonitor {                       struct MyMonitor {
+//  int data;                                  int data;
+//  a_struct_t more_data;                      a_struct_t more_data;
+//                                =>             monitor$ __mon_d;
+// };                                        };
+//                                           static inline monitor$ * get_coroutine( MyMonitor * this ) {
+//                                               return &this->__cor_d;
+//                                           }
+//                                           void lock(MyMonitor & this) {
+//                                               lock(get_monitor(this));
+//                                           }
+//                                           void unlock(MyMonitor & this) {
+//                                               unlock(get_monitor(this));
+//                                           }
+//
+struct MonitorKeyword final : public ConcurrentSueKeyword {
+	MonitorKeyword() : ConcurrentSueKeyword(
+		"monitor$",
+		"__mon",
+		"get_monitor",
+		"monitor keyword requires monitors to be in scope, add #include <monitor.hfa>\n",
+		"",
+		false,
+		ast::AggregateDecl::Monitor )
+	{}
+
+	virtual ~MonitorKeyword() {}
+};
+
+// Handles generator type declarations:
+//
+// generator MyGenerator {                   struct MyGenerator {
+//  int data;                                  int data;
+//  a_struct_t more_data;                      a_struct_t more_data;
+//                                =>             int __generator_state;
+// };                                        };
+//
+struct GeneratorKeyword final : public ConcurrentSueKeyword {
+	GeneratorKeyword() : ConcurrentSueKeyword(
+		"generator$",
+		"__generator_state",
+		"get_generator",
+		"Unable to find builtin type generator$\n",
+		"",
+		true,
+		ast::AggregateDecl::Generator )
+	{}
+
+	virtual ~GeneratorKeyword() {}
+};
+
+const ast::Decl * ConcurrentSueKeyword::postvisit(
+		const ast::StructDecl * decl ) {
+	if ( !decl->body ) {
+		return decl;
+	} else if ( cast_target == decl->kind ) {
+		return handleStruct( decl );
+	} else if ( type_name == decl->name ) {
+		assert( !type_decl );
+		type_decl = decl;
+	} else if ( exception_name == decl->name ) {
+		assert( !except_decl );
+		except_decl = decl;
+	} else if ( typeid_name == decl->name ) {
+		assert( !typeid_decl );
+		typeid_decl = decl;
+	} else if ( vtable_name == decl->name ) {
+		assert( !vtable_decl );
+		vtable_decl = decl;
+	}
+	return decl;
+}
+
+// Try to get the full definition, but raise an error on conflicts.
+const ast::FunctionDecl * getDefinition(
+		const ast::FunctionDecl * old_decl,
+		const ast::FunctionDecl * new_decl ) {
+	if ( !new_decl->stmts ) {
+		return old_decl;
+	} else if ( !old_decl->stmts ) {
+		return new_decl;
+	} else {
+		assert( !old_decl->stmts || !new_decl->stmts );
+		return nullptr;
+	}
+}
+
+const ast::DeclWithType * ConcurrentSueKeyword::postvisit(
+		const ast::FunctionDecl * decl ) {
+	if ( type_decl && isDestructorFor( decl, type_decl ) ) {
+		// Check for forward declarations, try to get the full definition.
+		dtor_decl = (dtor_decl) ? getDefinition( dtor_decl, decl ) : decl;
+	} else if ( !vtable_name.empty() && decl->has_body() ) {
+		if (const ast::DeclWithType * param = isMainFor( decl, cast_target )) {
+			if ( !vtable_decl ) {
+				SemanticError( decl, context_error );
+			}
+			// Should be safe because of isMainFor.
+			const ast::StructInstType * struct_type =
+				static_cast<const ast::StructInstType *>(
+					static_cast<const ast::ReferenceType *>(
+						param->get_type() )->base.get() );
+
+			handleMain( decl, struct_type );
+		}
+	}
+	return decl;
+}
+
+const ast::Expr * ConcurrentSueKeyword::postvisit(
+		const ast::KeywordCastExpr * expr ) {
+	if ( cast_target == expr->target ) {
+		// Convert `(thread &)ex` to `(thread$ &)*get_thread(ex)`, etc.
+		if ( !type_decl || !dtor_decl ) {
+			SemanticError( expr, context_error );
+		}
+		assert( nullptr == expr->result );
+		auto cast = ast::mutate( expr );
+		cast->result = new ast::ReferenceType( new ast::StructInstType( type_decl ) );
+		cast->concrete_target.field  = field_name;
+		cast->concrete_target.getter = getter_name;
+		return cast;
+	}
+	return expr;
+}
+
+const ast::StructDecl * ConcurrentSueKeyword::handleStruct(
+		const ast::StructDecl * decl ) {
+	assert( decl->body );
+
+	if ( !type_decl || !dtor_decl ) {
+		SemanticError( decl, context_error );
+	}
+
+	if ( !exception_name.empty() ) {
+		if( !typeid_decl || !vtable_decl ) {
+			SemanticError( decl, context_error );
+		}
+		addTypeId( decl );
+		addVtableForward( decl );
+	}
+
+	const ast::FunctionDecl * func = forwardDeclare( decl );
+	StructAndField addFieldRet = addField( decl );
+	decl = addFieldRet.decl;
+	const ast::ObjectDecl * field = addFieldRet.field;
+
+	addGetRoutines( field, func );
+	// Add routines to monitors for use by mutex stmt.
+	if ( ast::AggregateDecl::Monitor == cast_target ) {
+		addLockUnlockRoutines( decl );
+	}
+
+	return decl;
+}
+
+void ConcurrentSueKeyword::handleMain(
+		const ast::FunctionDecl * decl, const ast::StructInstType * type ) {
+	assert( vtable_decl );
+	assert( except_decl );
+
+	const CodeLocation & location = decl->location;
+
+	std::vector<ast::ptr<ast::Expr>> poly_args = {
+		new ast::TypeExpr( location, type ),
+	};
+	ast::ObjectDecl * vtable_object = Virtual::makeVtableInstance(
+		location,
+		"_default_vtable_object_declaration",
+		new ast::StructInstType( vtable_decl, copy( poly_args ) ),
+		type,
+		nullptr
+	);
+	declsToAddAfter.push_back( vtable_object );
+	declsToAddAfter.push_back(
+		new ast::ObjectDecl(
+			location,
+			Virtual::concurrentDefaultVTableName(),
+			new ast::ReferenceType( vtable_object->type, ast::CV::Const ),
+			new ast::SingleInit( location,
+				new ast::VariableExpr( location, vtable_object ) ),
+			ast::Storage::Classes(),
+			ast::Linkage::Cforall
+		)
+	);
+	declsToAddAfter.push_back( Virtual::makeGetExceptionFunction(
+		location,
+		vtable_object,
+		new ast::StructInstType( except_decl, copy( poly_args ) )
+	) );
+}
+
+void ConcurrentSueKeyword::addTypeId( const ast::StructDecl * decl ) {
+	assert( typeid_decl );
+	const CodeLocation & location = decl->location;
+
+	ast::StructInstType * typeid_type =
+		new ast::StructInstType( typeid_decl, ast::CV::Const );
+	typeid_type->params.push_back(
+		new ast::TypeExpr( location, new ast::StructInstType( decl ) ) );
+	declsToAddBefore.push_back(
+		Virtual::makeTypeIdInstance( location, typeid_type ) );
+	// If the typeid_type is going to be kept, the other reference will have
+	// been made by now, but we also get to avoid extra mutates.
+	ast::ptr<ast::StructInstType> typeid_cleanup = typeid_type;
+}
+
+void ConcurrentSueKeyword::addVtableForward( const ast::StructDecl * decl ) {
+	assert( vtable_decl );
+	const CodeLocation& location = decl->location;
+
+	std::vector<ast::ptr<ast::Expr>> poly_args = {
+		new ast::TypeExpr( location, new ast::StructInstType( decl ) ),
+	};
+	declsToAddBefore.push_back( Virtual::makeGetExceptionForward(
+		location,
+		new ast::StructInstType( vtable_decl, copy( poly_args ) ),
+		new ast::StructInstType( except_decl, copy( poly_args ) )
+	) );
+	ast::ObjectDecl * vtable_object = Virtual::makeVtableForward(
+		location,
+		"_default_vtable_object_declaration",
+		new ast::StructInstType( vtable_decl, std::move( poly_args ) )
+	);
+	declsToAddBefore.push_back( vtable_object );
+	declsToAddBefore.push_back(
+		new ast::ObjectDecl(
+			location,
+			Virtual::concurrentDefaultVTableName(),
+			new ast::ReferenceType( vtable_object->type, ast::CV::Const ),
+			nullptr,
+			ast::Storage::Extern,
+			ast::Linkage::Cforall
+		)
+	);
+}
+
+const ast::FunctionDecl * ConcurrentSueKeyword::forwardDeclare(
+		const ast::StructDecl * decl ) {
+	const CodeLocation & location = decl->location;
+
+	ast::StructDecl * forward = ast::deepCopy( decl );
+	{
+		// If removing members makes ref-count go to zero, do not free.
+		ast::ptr<ast::StructDecl> forward_ptr = forward;
+		forward->body = false;
+		forward->members.clear();
+		forward_ptr.release();
+	}
+
+	ast::ObjectDecl * this_decl = new ast::ObjectDecl(
+		location,
+		"this",
+		new ast::ReferenceType( new ast::StructInstType( decl ) ),
+		nullptr,
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall
+	);
+
+	ast::ObjectDecl * ret_decl = new ast::ObjectDecl(
+		location,
+		"ret",
+		new ast::PointerType( new ast::StructInstType( type_decl ) ),
+		nullptr,
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall
+	);
+
+	ast::FunctionDecl * get_decl = new ast::FunctionDecl(
+		location,
+		getter_name,
+		{}, // forall
+		{ this_decl }, // params
+		{ ret_decl }, // returns
+		nullptr, // stmts
+		ast::Storage::Static,
+		ast::Linkage::Cforall,
+		{ new ast::Attribute( "const" ) },
+		ast::Function::Inline
+	);
+	get_decl = fixupGenerics( get_decl, decl );
+
+	ast::FunctionDecl * main_decl = nullptr;
+	if ( needs_main ) {
+		// `this_decl` is copied here because the original was used above.
+		main_decl = new ast::FunctionDecl(
+			location,
+			"main",
+			{},
+			{ ast::deepCopy( this_decl ) },
+			{},
+			nullptr,
+			ast::Storage::Classes(),
+			ast::Linkage::Cforall
+		);
+		main_decl = fixupGenerics( main_decl, decl );
+	}
+
+	declsToAddBefore.push_back( forward );
+	if ( needs_main ) declsToAddBefore.push_back( main_decl );
+	declsToAddBefore.push_back( get_decl );
+
+	return get_decl;
+}
+
+ConcurrentSueKeyword::StructAndField ConcurrentSueKeyword::addField(
+		const ast::StructDecl * decl ) {
+	const CodeLocation & location = decl->location;
+
+	ast::ObjectDecl * field = new ast::ObjectDecl(
+		location,
+		field_name,
+		new ast::StructInstType( type_decl ),
+		nullptr,
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall
+	);
+
+	auto mutDecl = ast::mutate( decl );
+	mutDecl->members.push_back( field );
+
+	return {mutDecl, field};
+}
+
+void ConcurrentSueKeyword::addGetRoutines(
+		const ast::ObjectDecl * field, const ast::FunctionDecl * forward ) {
+	// Say it is generated at the "same" places as the forward declaration.
+	const CodeLocation & location = forward->location;
+
+	const ast::DeclWithType * param = forward->params.front();
+	ast::Stmt * stmt = new ast::ReturnStmt( location,
+		new ast::AddressExpr( location,
+			new ast::MemberExpr( location,
+				field,
+				new ast::CastExpr( location,
+					new ast::VariableExpr( location, param ),
+					ast::deepCopy( param->get_type()->stripReferences() ),
+					ast::ExplicitCast
+				)
+			)
+		)
+	);
+
+	ast::FunctionDecl * decl = ast::deepCopy( forward );
+	decl->stmts = new ast::CompoundStmt( location, { stmt } );
+	declsToAddAfter.push_back( decl );
+}
+
+void ConcurrentSueKeyword::addLockUnlockRoutines(
+		const ast::StructDecl * decl ) {
+	// This should only be used on monitors.
+	assert( ast::AggregateDecl::Monitor == cast_target );
+
+	const CodeLocation & location = decl->location;
+
+	// The parameter for both routines.
+	ast::ObjectDecl * this_decl = new ast::ObjectDecl(
+		location,
+		"this",
+		new ast::ReferenceType( new ast::StructInstType( decl ) ),
+		nullptr,
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall
+	);
+
+	ast::FunctionDecl * lock_decl = new ast::FunctionDecl(
+		location,
+		"lock",
+		{ /* forall */ },
+		{
+			// Copy the declaration of this.
+			ast::deepCopy( this_decl ),
+		},
+		{ /* returns */ },
+		nullptr,
+		ast::Storage::Static,
+		ast::Linkage::Cforall,
+		{ /* attributes */ },
+		ast::Function::Inline
+	);
+	lock_decl = fixupGenerics( lock_decl, decl );
+
+	lock_decl->stmts = new ast::CompoundStmt( location, {
+		new ast::ExprStmt( location,
+			new ast::UntypedExpr( location,
+				new ast::NameExpr( location, "lock" ),
+				{
+					new ast::UntypedExpr( location,
+						new ast::NameExpr( location, "get_monitor" ),
+						{ new ast::VariableExpr( location,
+							InitTweak::getParamThis( lock_decl ) ) }
+					)
+				}
+			)
+		)
+	} );
+
+	ast::FunctionDecl * unlock_decl = new ast::FunctionDecl(
+		location,
+		"unlock",
+		{ /* forall */ },
+		{
+			// Last use, consume the declaration of this.
+			this_decl,
+		},
+		{ /* returns */ },
+		nullptr,
+		ast::Storage::Static,
+		ast::Linkage::Cforall,
+		{ /* attributes */ },
+		ast::Function::Inline
+	);
+	unlock_decl = fixupGenerics( unlock_decl, decl );
+
+	unlock_decl->stmts = new ast::CompoundStmt( location, {
+		new ast::ExprStmt( location,
+			new ast::UntypedExpr( location,
+				new ast::NameExpr( location, "unlock" ),
+				{
+					new ast::UntypedExpr( location,
+						new ast::NameExpr( location, "get_monitor" ),
+						{ new ast::VariableExpr( location,
+							InitTweak::getParamThis( unlock_decl ) ) }
+					)
+				}
+			)
+		)
+	} );
+
+	declsToAddAfter.push_back( lock_decl );
+	declsToAddAfter.push_back( unlock_decl );
+}
+
+
+// --------------------------------------------------------------------------
+struct SuspendKeyword final :
+		public ast::WithStmtsToAdd<>, public ast::WithGuards {
+	SuspendKeyword() = default;
+	virtual ~SuspendKeyword() = default;
+
+	void previsit( const ast::FunctionDecl * );
+	const ast::DeclWithType * postvisit( const ast::FunctionDecl * );
+	const ast::Stmt * postvisit( const ast::SuspendStmt * );
+
+private:
+	bool is_real_suspend( const ast::FunctionDecl * );
+
+	const ast::Stmt * make_generator_suspend( const ast::SuspendStmt * );
+	const ast::Stmt * make_coroutine_suspend( const ast::SuspendStmt * );
+
+	struct LabelPair {
+		ast::Label obj;
+		int idx;
+	};
+
+	LabelPair make_label(const ast::Stmt * stmt ) {
+		labels.push_back( ControlStruct::newLabel( "generator", stmt ) );
+		return { labels.back(), int(labels.size()) };
+	}
+
+	const ast::DeclWithType * in_generator = nullptr;
+	const ast::FunctionDecl * decl_suspend = nullptr;
+	std::vector<ast::Label> labels;
+};
+
+void SuspendKeyword::previsit( const ast::FunctionDecl * decl ) {
+	GuardValue( in_generator ); in_generator = nullptr;
+
+	// If it is the real suspend, grab it if we don't have one already.
+	if ( is_real_suspend( decl ) ) {
+		decl_suspend = decl_suspend ? decl_suspend : decl;
+		return;
+	}
+
+	// Otherwise check if this is a generator main and, if so, handle it.
+	auto param = isMainFor( decl, ast::AggregateDecl::Generator );
+	if ( !param ) return;
+
+	if ( 0 != decl->returns.size() ) {
+		SemanticError( decl->location, "Generator main must return void" );
+	}
+
+	in_generator = param;
+	GuardValue( labels ); labels.clear();
+}
+
+const ast::DeclWithType * SuspendKeyword::postvisit(
+		const ast::FunctionDecl * decl ) {
+	// Only modify a full definition of a generator with states.
+	if ( !decl->stmts || !in_generator || labels.empty() ) return decl;
+
+	const CodeLocation & location = decl->location;
+
+	// Create a new function body:
+	// static void * __generator_labels[] = {&&s0, &&s1, ...};
+	// void * __generator_label = __generator_labels[GEN.__generator_state];
+	// goto * __generator_label;
+	// s0: ;
+	// OLD_BODY
+
+	// This is the null statement inserted right before the body.
+	ast::NullStmt * noop = new ast::NullStmt( location );
+	noop->labels.push_back( ControlStruct::newLabel( "generator", noop ) );
+	const ast::Label & first_label = noop->labels.back();
+
+	// Add each label to the init, starting with the first label.
+	std::vector<ast::ptr<ast::Init>> inits = {
+		new ast::SingleInit( location,
+			new ast::LabelAddressExpr( location, copy( first_label ) ) ) };
+	// Then go through all the stored labels, and clear the store.
+	for ( auto && label : labels ) {
+		inits.push_back( new ast::SingleInit( label.location,
+			new ast::LabelAddressExpr( label.location, std::move( label )
+			) ) );
+	}
+	labels.clear();
+	// Then construct the initializer itself.
+	auto init = new ast::ListInit( location, std::move( inits ) );
+
+	ast::ObjectDecl * generatorLabels = new ast::ObjectDecl(
+		location,
+		"__generator_labels",
+		new ast::ArrayType(
+			new ast::PointerType( new ast::VoidType() ),
+			nullptr,
+			ast::FixedLen,
+			ast::DynamicDim
+		),
+		init,
+		ast::Storage::Classes(),
+		ast::Linkage::AutoGen
+	);
+
+	ast::ObjectDecl * generatorLabel = new ast::ObjectDecl(
+		location,
+		"__generator_label",
+		new ast::PointerType( new ast::VoidType() ),
+		new ast::SingleInit( location,
+			new ast::UntypedExpr( location,
+				new ast::NameExpr( location, "?[?]" ),
+				{
+					// TODO: Could be a variable expr.
+					new ast::NameExpr( location, "__generator_labels" ),
+					new ast::UntypedMemberExpr( location,
+						new ast::NameExpr( location, "__generator_state" ),
+						new ast::VariableExpr( location, in_generator )
+					)
+				}
+			)
+		),
+		ast::Storage::Classes(),
+		ast::Linkage::AutoGen
+	);
+
+	ast::BranchStmt * theGoTo = new ast::BranchStmt(
+		location, new ast::VariableExpr( location, generatorLabel )
+	);
+
+	// The noop goes here in order.
+
+	ast::CompoundStmt * body = new ast::CompoundStmt( location, {
+		{ new ast::DeclStmt( location, generatorLabels ) },
+		{ new ast::DeclStmt( location, generatorLabel ) },
+		{ theGoTo },
+		{ noop },
+		{ decl->stmts },
+	} );
+
+	auto mutDecl = ast::mutate( decl );
+	mutDecl->stmts = body;
+	return mutDecl;
+}
+
+const ast::Stmt * SuspendKeyword::postvisit( const ast::SuspendStmt * stmt ) {
+	switch ( stmt->type ) {
+	case ast::SuspendStmt::None:
+		// Use the context to determain the implicit target.
+		if ( in_generator ) {
+			return make_generator_suspend( stmt );
+		} else {
+			return make_coroutine_suspend( stmt );
+		}
+	case ast::SuspendStmt::Coroutine:
+		return make_coroutine_suspend( stmt );
+	case ast::SuspendStmt::Generator:
+		// Generator suspends must be directly in a generator.
+		if ( !in_generator ) SemanticError( stmt->location, "'suspend generator' must be used inside main of generator type." );
+		return make_generator_suspend( stmt );
+	}
+	assert( false );
+	return stmt;
+}
+
+/// Find the real/official suspend declaration.
+bool SuspendKeyword::is_real_suspend( const ast::FunctionDecl * decl ) {
+	return ( !decl->linkage.is_mangled
+		&& 0 == decl->params.size()
+		&& 0 == decl->returns.size()
+		&& "__cfactx_suspend" == decl->name );
+}
+
+const ast::Stmt * SuspendKeyword::make_generator_suspend(
+		const ast::SuspendStmt * stmt ) {
+	assert( in_generator );
+	// Target code is:
+	//   GEN.__generator_state = X;
+	//   THEN
+	//   return;
+	//   __gen_X:;
+
+	const CodeLocation & location = stmt->location;
+
+	LabelPair label = make_label( stmt );
+
+	// This is the context saving statement.
+	stmtsToAddBefore.push_back( new ast::ExprStmt( location,
+		new ast::UntypedExpr( location,
+			new ast::NameExpr( location, "?=?" ),
+			{
+				new ast::UntypedMemberExpr( location,
+					new ast::NameExpr( location, "__generator_state" ),
+					new ast::VariableExpr( location, in_generator )
+				),
+				ast::ConstantExpr::from_int( location, label.idx ),
+			}
+		)
+	) );
+
+	// The THEN component is conditional (return is not).
+	if ( stmt->then ) {
+		stmtsToAddBefore.push_back( stmt->then.get() );
+	}
+	stmtsToAddBefore.push_back( new ast::ReturnStmt( location, nullptr ) );
+
+	// The null statement replaces the old suspend statement.
+	return new ast::NullStmt( location, { label.obj } );
+}
+
+const ast::Stmt * SuspendKeyword::make_coroutine_suspend(
+		const ast::SuspendStmt * stmt ) {
+	// The only thing we need from the old statement is the location.
+	const CodeLocation & location = stmt->location;
+
+	if ( !decl_suspend ) {
+		SemanticError( location, "suspend keyword applied to coroutines requires coroutines to be in scope, add #include <coroutine.hfa>\n" );
+	}
+	if ( stmt->then ) {
+		SemanticError( location, "Compound statement following coroutines is not implemented." );
+	}
+
+	return new ast::ExprStmt( location,
+		new ast::UntypedExpr( location,
+			ast::VariableExpr::functionPointer( location, decl_suspend ) )
+	);
 }
 
@@ -251,5 +1088,5 @@
 				{
 					new ast::SingleInit( location,
-						new ast::AddressExpr(
+						new ast::AddressExpr( location,
 							new ast::VariableExpr( location, monitor ) ) ),
 					new ast::SingleInit( location,
@@ -564,8 +1401,12 @@
 
 // --------------------------------------------------------------------------
+// Interface Functions:
 
 void implementKeywords( ast::TranslationUnit & translationUnit ) {
-	(void)translationUnit;
-	assertf(false, "Apply Keywords not implemented." );
+	ast::Pass<ThreadKeyword>::run( translationUnit );
+	ast::Pass<CoroutineKeyword>::run( translationUnit );
+	ast::Pass<MonitorKeyword>::run( translationUnit );
+	ast::Pass<GeneratorKeyword>::run( translationUnit );
+	ast::Pass<SuspendKeyword>::run( translationUnit );
 }
 
Index: src/Validate/ForallPointerDecay.cpp
===================================================================
--- src/Validate/ForallPointerDecay.cpp	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Validate/ForallPointerDecay.cpp	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -70,9 +70,5 @@
 		AssertionList assertions;
 		// Substitute trait decl parameters for instance parameters.
-		ast::TypeSubstitution sub(
-			inst->base->params.begin(),
-			inst->base->params.end(),
-			inst->params.begin()
-		);
+		ast::TypeSubstitution sub( inst->base->params, inst->params );
 		for ( const ast::ptr<ast::Decl> & decl : inst->base->members ) {
 			ast::ptr<ast::DeclWithType> copy =
Index: src/Virtual/Tables.cc
===================================================================
--- src/Virtual/Tables.cc	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Virtual/Tables.cc	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -10,8 +10,15 @@
 // Created On       : Mon Aug 31 11:11:00 2020
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Apr 21 15:36:00 2021
-// Update Count     : 2
-//
-
+// Last Modified On : Fri Mar 11 10:40:00 2022
+// Update Count     : 3
+//
+
+#include "AST/Attribute.hpp"
+#include "AST/Copy.hpp"
+#include "AST/Decl.hpp"
+#include "AST/Expr.hpp"
+#include "AST/Init.hpp"
+#include "AST/Stmt.hpp"
+#include "AST/Type.hpp"
 #include <SynTree/Attribute.h>
 #include <SynTree/Declaration.h>
@@ -77,7 +84,31 @@
 }
 
+static ast::ObjectDecl * makeVtableDeclaration(
+		CodeLocation const & location, std::string const & name,
+		ast::StructInstType const * type, ast::Init const * init ) {
+	ast::Storage::Classes storage;
+	if ( nullptr == init ) {
+		storage.is_extern = true;
+	}
+	return new ast::ObjectDecl(
+		location,
+		name,
+		type,
+		init,
+		storage,
+		ast::Linkage::Cforall
+	);
+}
+
 ObjectDecl * makeVtableForward( std::string const & name, StructInstType * type ) {
 	assert( type );
 	return makeVtableDeclaration( name, type, nullptr );
+}
+
+ast::ObjectDecl * makeVtableForward(
+		CodeLocation const & location, std::string const & name,
+		ast::StructInstType const * vtableType ) {
+	assert( vtableType );
+	return makeVtableDeclaration( location, name, vtableType, nullptr );
 }
 
@@ -123,4 +154,70 @@
 }
 
+static std::vector<ast::ptr<ast::Init>> buildInits(
+		CodeLocation const & location,
+		//std::string const & name,
+		ast::StructInstType const * vtableType,
+		ast::Type const * objectType ) {
+	ast::StructDecl const * vtableStruct = vtableType->base;
+
+	std::vector<ast::ptr<ast::Init>> inits;
+	inits.reserve( vtableStruct->members.size() );
+
+	// This is designed to run before the resolver.
+	for ( auto field : vtableStruct->members ) {
+		if ( std::string( "parent" ) == field->name ) {
+			// This will not work with polymorphic state.
+			auto oField = field.strict_as<ast::ObjectDecl>();
+			auto fieldType = oField->type.strict_as<ast::PointerType>();
+			auto parentType = fieldType->base.strict_as<ast::StructInstType>();
+			std::string const & parentInstance = instanceName( parentType->name );
+			inits.push_back(
+					new ast::SingleInit( location, new ast::AddressExpr( new ast::NameExpr( location, parentInstance ) ) ) );
+		} else if ( std::string( "__cfavir_typeid" ) == field->name ) {
+			std::string const & baseType = baseTypeName( vtableType->name );
+			std::string const & typeId = typeIdName( baseType );
+			inits.push_back( new ast::SingleInit( location, new ast::AddressExpr( new ast::NameExpr( location, typeId ) ) ) );
+		} else if ( std::string( "size" ) == field->name ) {
+			inits.push_back( new ast::SingleInit( location, new ast::SizeofExpr( location, objectType )
+			) );
+		} else if ( std::string( "align" ) == field->name ) {
+			inits.push_back( new ast::SingleInit( location,
+				new ast::AlignofExpr( location, objectType )
+			) );
+		} else {
+			inits.push_back( new ast::SingleInit( location,
+				new ast::NameExpr( location, field->name )
+			) );
+		}
+		//ast::Expr * expr = buildInitExpr(...);
+		//inits.push_back( new ast::SingleInit( location, expr ) )
+	}
+
+	return inits;
+}
+
+ast::ObjectDecl * makeVtableInstance(
+		CodeLocation const & location,
+		std::string const & name,
+		ast::StructInstType const * vtableType,
+		ast::Type const * objectType,
+		ast::Init const * init ) {
+	assert( vtableType );
+	assert( objectType );
+
+	// Build the initialization.
+	if ( nullptr == init ) {
+		init = new ast::ListInit( location,
+			buildInits( location, vtableType, objectType ) );
+
+	// The provided init should initialize everything except the parent
+	// pointer, the size-of and align-of fields. These should be inserted.
+	} else {
+		// Except this is not yet supported.
+		assert(false);
+	}
+	return makeVtableDeclaration( location, name, vtableType, init );
+}
+
 namespace {
 	std::string const functionName = "get_exception_vtable";
@@ -140,5 +237,5 @@
 		new ReferenceType( noQualifiers, vtableType ),
 		nullptr,
-        { new Attribute("unused") }
+		{ new Attribute("unused") }
 	) );
 	type->parameters.push_back( new ObjectDecl(
@@ -157,4 +254,31 @@
 		type,
 		nullptr
+	);
+}
+
+ast::FunctionDecl * makeGetExceptionForward(
+		CodeLocation const & location,
+		ast::Type const * vtableType,
+		ast::Type const * exceptType ) {
+	assert( vtableType );
+	assert( exceptType );
+	return new ast::FunctionDecl(
+		location,
+		functionName,
+		{ /* forall */ },
+		{ new ast::ObjectDecl(
+			location,
+			"__unused",
+			new ast::PointerType( exceptType )
+		) },
+		{ new ast::ObjectDecl(
+			location,
+			"_retvalue",
+			new ast::ReferenceType( vtableType )
+		) },
+		nullptr,
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall,
+		{ new ast::Attribute( "unused" ) }
 	);
 }
@@ -172,4 +296,17 @@
 }
 
+ast::FunctionDecl * makeGetExceptionFunction(
+		CodeLocation const & location,
+		ast::ObjectDecl const * vtableInstance, ast::Type const * exceptType ) {
+	assert( vtableInstance );
+	assert( exceptType );
+	ast::FunctionDecl * func = makeGetExceptionForward(
+			location, ast::deepCopy( vtableInstance->type ), exceptType );
+	func->stmts = new ast::CompoundStmt( location, {
+		new ast::ReturnStmt( location, new ast::VariableExpr( location, vtableInstance ) )
+	} );
+	return func;
+}
+
 ObjectDecl * makeTypeIdInstance( StructInstType const * typeIdType ) {
 	assert( typeIdType );
@@ -191,3 +328,26 @@
 }
 
-}
+ast::ObjectDecl * makeTypeIdInstance(
+		CodeLocation const & location,
+		ast::StructInstType const * typeIdType ) {
+	assert( typeIdType );
+	ast::StructInstType * type = ast::mutate( typeIdType );
+	type->set_const( true );
+	std::string const & typeid_name = typeIdTypeToInstance( typeIdType->name );
+	return new ast::ObjectDecl(
+		location,
+		typeid_name,
+		type,
+		new ast::ListInit( location, {
+			new ast::SingleInit( location,
+				new ast::AddressExpr( location,
+					new ast::NameExpr( location, "__cfatid_exception_t" ) ) )
+		} ),
+		ast::Storage::Classes(),
+		ast::Linkage::Cforall,
+		nullptr,
+		{ new ast::Attribute( "cfa_linkonce" ) }
+	);
+}
+
+}
Index: src/Virtual/Tables.h
===================================================================
--- src/Virtual/Tables.h	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/Virtual/Tables.h	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -10,10 +10,12 @@
 // Created On       : Mon Aug 31 11:07:00 2020
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Apr 21 10:30:00 2021
-// Update Count     : 2
+// Last Modified On : Wec Dec  8 16:58:00 2021
+// Update Count     : 3
 //
 
 #include <list>  // for list
 
+#include <string>
+#include "AST/Fwd.hpp"
 class Declaration;
 class StructDecl;
@@ -35,4 +37,7 @@
  * vtableType node is consumed.
  */
+ast::ObjectDecl * makeVtableForward(
+	CodeLocation const & location, std::string const & name,
+	ast::StructInstType const * vtableType );
 
 ObjectDecl * makeVtableInstance(
@@ -43,4 +48,10 @@
  * vtableType and init (if provided) nodes are consumed.
  */
+ast::ObjectDecl * makeVtableInstance(
+	CodeLocation const & location,
+	std::string const & name,
+	ast::StructInstType const * vtableType,
+	ast::Type const * objectType,
+	ast::Init const * init = nullptr );
 
 // Some special code for how exceptions interact with virtual tables.
@@ -49,4 +60,8 @@
  * linking the vtableType to the exceptType. Both nodes are consumed.
  */
+ast::FunctionDecl * makeGetExceptionForward(
+	CodeLocation const & location,
+	ast::Type const * vtableType,
+	ast::Type const * exceptType );
 
 FunctionDecl * makeGetExceptionFunction(
@@ -55,4 +70,7 @@
  * exceptType node is consumed.
  */
+ast::FunctionDecl * makeGetExceptionFunction(
+	CodeLocation const & location,
+	ast::ObjectDecl const * vtableInstance, ast::Type const * exceptType );
 
 ObjectDecl * makeTypeIdInstance( StructInstType const * typeIdType );
@@ -60,4 +78,6 @@
  * TODO: Should take the parent type. Currently locked to the exception_t.
  */
+ast::ObjectDecl * makeTypeIdInstance(
+	const CodeLocation & location, ast::StructInstType const * typeIdType );
 
 }
Index: src/main.cc
===================================================================
--- src/main.cc	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ src/main.cc	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -10,6 +10,6 @@
 // Created On       : Fri May 15 23:12:02 2015
 // Last Modified By : Andrew Beach
-// Last Modified On : Wed Jan 26 14:09:00 2022
-// Update Count     : 670
+// Last Modified On : Fri Mar 11 10:39:00 2022
+// Update Count     : 671
 //
 
@@ -333,9 +333,4 @@
 
 		if( useNewAST ) {
-			PASS( "Implement Concurrent Keywords", Concurrency::applyKeywords( translationUnit ) );
-			//PASS( "Forall Pointer Decay - A", SymTab::decayForallPointersA( translationUnit ) );
-			//PASS( "Forall Pointer Decay - B", SymTab::decayForallPointersB( translationUnit ) );
-			//PASS( "Forall Pointer Decay - C", SymTab::decayForallPointersC( translationUnit ) );
-			//PASS( "Forall Pointer Decay - D", SymTab::decayForallPointersD( translationUnit ) );
 			CodeTools::fillLocations( translationUnit );
 
@@ -347,4 +342,6 @@
 
 			forceFillCodeLocations( transUnit );
+
+			PASS( "Implement Concurrent Keywords", Concurrency::implementKeywords( transUnit ) );
 
 			// Must be after implement concurrent keywords; because uniqueIds
@@ -497,6 +494,4 @@
 			PASS( "Translate Tries" , ControlStruct::translateTries( translationUnit ) );
 		}
-
-		
 
 		PASS( "Gen Waitfor" , Concurrency::generateWaitFor( translationUnit ) );
Index: tests/Makefile.am
===================================================================
--- tests/Makefile.am	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ tests/Makefile.am	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -66,6 +66,6 @@
 PRETTY_PATH=mkdir -p $(dir $(abspath ${@})) && cd ${srcdir} &&
 
-.PHONY: list .validate
-.INTERMEDIATE: .validate .validate.cfa
+.PHONY: list .validate .test_makeflags
+.INTERMEDIATE: .validate .validate.cfa .test_makeflags
 EXTRA_PROGRAMS = avl_test linkonce .dummy_hack # build but do not install
 EXTRA_DIST = test.py \
@@ -123,4 +123,7 @@
 	@+${TEST_PY} --list ${concurrent}
 
+.test_makeflags:
+	@echo "${MAKEFLAGS}"
+
 .validate: .validate.cfa
 	$(CFACOMPILE) .validate.cfa -fsyntax-only -Wall -Wextra -Werror
Index: tests/concurrent/.expect/keywordErrors.nast.txt
===================================================================
--- tests/concurrent/.expect/keywordErrors.nast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/concurrent/.expect/keywordErrors.nast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,6 @@
+concurrent/keywordErrors.cfa:1:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread A with body
+
+concurrent/keywordErrors.cfa:6:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread B with body
+
Index: tests/concurrent/.expect/keywordErrors.oast.txt
===================================================================
--- tests/concurrent/.expect/keywordErrors.oast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/concurrent/.expect/keywordErrors.oast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,6 @@
+concurrent/keywordErrors.cfa:1:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread A: with body 1
+
+concurrent/keywordErrors.cfa:6:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread B: with body 1
+
Index: sts/concurrent/.expect/keywordErrors.txt
===================================================================
--- tests/concurrent/.expect/keywordErrors.txt	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ 	(revision )
@@ -1,6 +1,0 @@
-concurrent/keywordErrors.cfa:1:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
-thread A: with body 1
-
-concurrent/keywordErrors.cfa:6:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
-thread B: with body 1
-
Index: tests/concurrent/.expect/mainError.nast.txt
===================================================================
--- tests/concurrent/.expect/mainError.nast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/concurrent/.expect/mainError.nast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,11 @@
+concurrent/mainError.cfa:1:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread Test with body
+
+concurrent/mainError.cfa:2:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+main: function
+... with parameters
+  reference to instance of struct Test with body
+... returning nothing
+ with body
+  Compound Statement:
+
Index: tests/concurrent/.expect/mainError.oast.txt
===================================================================
--- tests/concurrent/.expect/mainError.oast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/concurrent/.expect/mainError.oast.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,11 @@
+concurrent/mainError.cfa:1:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+thread Test: with body 1
+
+concurrent/mainError.cfa:2:1 error: thread keyword requires threads to be in scope, add #include <thread.hfa>
+main: function
+... with parameters
+  reference to instance of struct Test with body 1
+... returning nothing
+... with body
+  CompoundStmt
+
Index: tests/concurrent/mainError.cfa
===================================================================
--- tests/concurrent/mainError.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/concurrent/mainError.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,2 @@
+thread Test {};
+void main(Test&) {}
Index: tests/io/.expect/away_fair.txt
===================================================================
--- tests/io/.expect/away_fair.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/io/.expect/away_fair.txt	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,12 @@
+starting
+100
+200
+300
+400
+500
+600
+700
+800
+900
+1000
+done
Index: tests/io/away_fair.cfa
===================================================================
--- tests/io/away_fair.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
+++ tests/io/away_fair.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -0,0 +1,106 @@
+//
+// Cforall Version 1.0.0 Copyright (C) 2022 University of Waterloo
+//
+// The contents of this file are covered under the licence agreement in the
+// file "LICENCE" distributed with Cforall.
+//
+// away_fair.cfa -- Test that spinning doesn't cause submissions to get stuck.
+//                  This test should work without io_uring but isn't very useful without
+//
+// Author           : Thierry Delisle
+// Created On       : Wed Mar 2 12:56:51 2022
+// Last Modified By :
+// Last Modified On :
+// Update Count     :
+//
+
+#include <bits/defs.hfa>
+#include <fstream.hfa>
+#include <kernel.hfa>
+#include <thread.hfa>
+#include <iofwd.hfa>
+
+Duration default_preemption() {
+	return 0;
+}
+
+enum { TIMES = 1000 };
+
+volatile unsigned counter = 0;
+
+// ----- Spinner -----
+// spins trying to prevent other threads from getting to this processor
+thread Spinner {};
+void ^?{}(Spinner &mutex ) {}
+void main(Spinner &) {
+	unsigned last = 0;
+	for() {
+		unsigned curr = __atomic_load_n(&counter, __ATOMIC_SEQ_CST);
+
+		if(curr >= TIMES) return;
+
+		if(last == curr) {
+			Pause();
+			continue;
+		}
+
+		last = curr;
+		yield();
+	}
+}
+
+// ----- Submitter -----
+// try to submit io but yield so that it's likely we are moved to the slow path
+thread Submitter {};
+void ^?{}(Submitter &mutex ) {}
+void main(Submitter & this) {
+	for(TIMES) {
+		#if CFA_HAVE_LINUX_IO_URING_H
+			io_future_t f;
+			struct io_uring_sqe * sqe;
+			__u32 idx;
+			struct $io_context * ctx = cfa_io_allocate(&sqe, &idx, 1);
+
+			zero_sqe(sqe);
+			sqe->opcode = IORING_OP_NOP;
+			sqe->user_data = (uintptr_t)&f;
+		#endif
+
+		yield( prng( this, 15 ) );
+
+		#if CFA_HAVE_LINUX_IO_URING_H
+			// Submit everything
+			asm volatile("": : :"memory");
+			cfa_io_submit( ctx, &idx, 1, false );
+		#endif
+
+		unsigned i = __atomic_add_fetch( &counter, 1, __ATOMIC_SEQ_CST );
+		if(0 == (i % 100)) sout | i;
+
+		#if CFA_HAVE_LINUX_IO_URING_H
+			wait( f );
+		#endif
+	}
+}
+
+// ----- Yielder -----
+// Add some chaos into the mix
+thread Yielder {};
+void ^?{}(Yielder &mutex ) {}
+void main(Yielder&) {
+	while(TIMES > __atomic_load_n(&counter, __ATOMIC_SEQ_CST)) {
+		yield();
+	}
+}
+
+
+int main() {
+	processor p;
+	sout | "starting";
+	{
+		Yielder y;
+		Spinner s;
+		Submitter io;
+	}
+	sout | "done";
+}
Index: tests/io/many_read.cfa
===================================================================
--- tests/io/many_read.cfa	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ tests/io/many_read.cfa	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -5,5 +5,5 @@
 // file "LICENCE" distributed with Cforall.
 //
-// many_read.cfa -- Make sure that multiple concurrent reads to mess up.
+// many_read.cfa -- Make sure that multiple concurrent reads don't mess up.
 //
 // Author           : Thierry Delisle
Index: tests/pybin/settings.py
===================================================================
--- tests/pybin/settings.py	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ tests/pybin/settings.py	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -155,4 +155,5 @@
 	global generating
 	global make
+	global make_jobfds
 	global output_width
 	global timeout
@@ -168,4 +169,5 @@
 	generating   = options.regenerate_expected
 	make         = ['make']
+	make_jobfds  = []
 	output_width = 24
 	timeout      = Timeouts(options.timeout, options.global_timeout)
@@ -177,8 +179,11 @@
 		os.putenv('DISTCC_LOG', os.path.join(BUILDDIR, 'distcc_error.log'))
 
-def update_make_cmd(force, jobs):
+def update_make_cmd(flags):
 	global make
-
-	make = ['make'] if not force else ['make', "-j%i" % jobs]
+	make = ['make', *flags]
+
+def update_make_fds(r, w):
+	global make_jobfds
+	make_jobfds = (r, w)
 
 def validate():
@@ -187,15 +192,9 @@
 	global distcc
 	distcc       = "DISTCC_CFA_PATH=~/.cfadistcc/%s/cfa" % tools.config_hash()
-	errf = os.path.join(BUILDDIR, ".validate.err")
-	make_ret, out = tools.make( ".validate", error_file = errf, output_file=subprocess.DEVNULL, error=subprocess.DEVNULL )
+	make_ret, out, err = tools.make( ".validate", output_file=subprocess.PIPE, error=subprocess.PIPE )
 	if make_ret != 0:
-		with open (errf, "r") as myfile:
-			error=myfile.read()
 		print("ERROR: Invalid configuration %s:%s" % (arch.string, debug.string), file=sys.stderr)
-		print("       verify returned : \n%s" % error, file=sys.stderr)
-		tools.rm(errf)
+		print("       verify returned : \n%s" % err, file=sys.stderr)
 		sys.exit(1)
-
-	tools.rm(errf)
 
 def prep_output(tests):
Index: tests/pybin/tools.py
===================================================================
--- tests/pybin/tools.py	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ tests/pybin/tools.py	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -23,5 +23,5 @@
 
 # helper functions to run terminal commands
-def sh(*cmd, timeout = False, output_file = None, input_file = None, input_text = None, error = subprocess.STDOUT, ignore_dry_run = False):
+def sh(*cmd, timeout = False, output_file = None, input_file = None, input_text = None, error = subprocess.STDOUT, ignore_dry_run = False, pass_fds = []):
 	try:
 		cmd = list(cmd)
@@ -65,22 +65,23 @@
 				**({'input' : bytes(input_text, encoding='utf-8')} if input_text else {'stdin' : input_file}),
 				stdout  = output_file,
-				stderr  = error
+				stderr  = error,
+				pass_fds = pass_fds
 			) as proc:
 
 				try:
-					out, _ = proc.communicate(
+					out, errout = proc.communicate(
 						timeout = settings.timeout.single if timeout else None
 					)
 
-					return proc.returncode, out.decode("latin-1") if out else None
+					return proc.returncode, out.decode("latin-1") if out else None, errout.decode("latin-1") if errout else None
 				except subprocess.TimeoutExpired:
 					if settings.timeout2gdb:
 						print("Process {} timeout".format(proc.pid))
 						proc.communicate()
-						return 124, str(None)
+						return 124, str(None), "Subprocess Timeout 2 gdb"
 					else:
 						proc.send_signal(signal.SIGABRT)
 						proc.communicate()
-						return 124, str(None)
+						return 124, str(None), "Subprocess Timeout 2 gdb"
 
 	except Exception as ex:
@@ -105,7 +106,7 @@
 		return (False, "No file")
 
-	code, out = sh("file", fname, output_file=subprocess.PIPE)
+	code, out, err = sh("file", fname, output_file=subprocess.PIPE)
 	if code != 0:
-		return (False, "'file EXPECT' failed with code {}".format(code))
+		return (False, "'file EXPECT' failed with code {} '{}'".format(code, err))
 
 	match = re.search(".*: (.*)", out)
@@ -190,5 +191,5 @@
 	]
 	cmd = [s for s in cmd if s]
-	return sh(*cmd, output_file=output_file, error=error)
+	return sh(*cmd, output_file=output_file, error=error, pass_fds=settings.make_jobfds)
 
 def make_recon(target):
@@ -241,10 +242,10 @@
 # move a file
 def mv(source, dest):
-	ret, _ = sh("mv", source, dest)
+	ret, _, _ = sh("mv", source, dest)
 	return ret
 
 # cat one file into the other
 def cat(source, dest):
-	ret, _ = sh("cat", source, output_file=dest)
+	ret, _, _ = sh("cat", source, output_file=dest)
 	return ret
 
@@ -289,33 +290,141 @@
 #               system
 ################################################################################
+def jobserver_version():
+	make_ret, out, err = sh('make', '.test_makeflags', '-j2', output_file=subprocess.PIPE, error=subprocess.PIPE)
+	if make_ret != 0:
+		print("ERROR: cannot find Makefile jobserver version", file=sys.stderr)
+		print("       test returned : {} '{}'".format(make_ret, err), file=sys.stderr)
+		sys.exit(1)
+
+	re_jobs = re.search("--jobserver-(auth|fds)", out)
+	if not re_jobs:
+		print("ERROR: cannot find Makefile jobserver version", file=sys.stderr)
+		print("       MAKEFLAGS are : '{}'".format(out), file=sys.stderr)
+		sys.exit(1)
+
+	return "--jobserver-{}".format(re_jobs.group(1))
+
+def prep_recursive_make(N):
+	if N < 2:
+		return []
+
+	# create the pipe
+	(r, w) = os.pipe()
+
+	# feel it with N-1 tokens, (Why N-1 and not N, I don't know it's in the manpage for make)
+	os.write(w, b'+' * (N - 1));
+
+	# prep the flags for make
+	make_flags = ["-j{}".format(N), "--jobserver-auth={},{}".format(r, w)]
+
+	# tell make about the pipes
+	os.environ["MAKEFLAGS"] = os.environ["MFLAGS"] = " ".join(make_flags)
+
+	# make sure pass the pipes to our children
+	settings.update_make_fds(r, w)
+
+	return make_flags
+
+def prep_unlimited_recursive_make():
+	# prep the flags for make
+	make_flags = ["-j"]
+
+	# tell make about the pipes
+	os.environ["MAKEFLAGS"] = os.environ["MFLAGS"] = "-j"
+
+	return make_flags
+
+
+def eval_hardware():
+	# we can create as many things as we want
+	# how much hardware do we have?
+	if settings.distribute:
+		# remote hardware is allowed
+		# how much do we have?
+		ret, jstr, _ = sh("distcc", "-j", output_file=subprocess.PIPE, ignore_dry_run=True)
+		return int(jstr.strip()) if ret == 0 else multiprocessing.cpu_count()
+	else:
+		# remote isn't allowed, use local cpus
+		return multiprocessing.cpu_count()
+
 # count number of jobs to create
-def job_count( options, tests ):
+def job_count( options ):
 	# check if the user already passed in a number of jobs for multi-threading
-	if not options.jobs:
-		make_flags = os.environ.get('MAKEFLAGS')
-		force = bool(make_flags)
-		make_jobs_fds = re.search("--jobserver-(auth|fds)=\s*([0-9]+),([0-9]+)", make_flags) if make_flags else None
-		if make_jobs_fds :
-			tokens = os.read(int(make_jobs_fds.group(2)), 1024)
-			options.jobs = len(tokens)
-			os.write(int(make_jobs_fds.group(3)), tokens)
-		else :
-			if settings.distribute:
-				ret, jstr = sh("distcc", "-j", output_file=subprocess.PIPE, ignore_dry_run=True)
-				if ret == 0:
-					options.jobs = int(jstr.strip())
-				else :
-					options.jobs = multiprocessing.cpu_count()
-			else:
-				options.jobs = multiprocessing.cpu_count()
+	make_env = os.environ.get('MAKEFLAGS')
+	make_flags = make_env.split() if make_env else None
+	jobstr = jobserver_version()
+
+	if options.jobs and make_flags:
+		print('WARNING: -j options should not be specified when called form Make', file=sys.stderr)
+
+	# Top level make is calling the shots, just follow
+	if make_flags:
+		# do we have -j and --jobserver-...
+		jobopt = None
+		exists_fds = None
+		for f in make_flags:
+			jobopt = f if f.startswith("-j") else jobopt
+			exists_fds = f if f.startswith(jobstr) else exists_fds
+
+		# do we have limited parallelism?
+		if exists_fds :
+			try:
+				rfd, wfd = tuple(exists_fds.split('=')[1].split(','))
+			except:
+				print("ERROR: jobserver has unrecoginzable format, was '{}'".format(exists_fds), file=sys.stderr)
+				sys.exit(1)
+
+			# read the token pipe to count number of available tokens and restore the pipe
+			# this assumes the test suite script isn't invoked in parellel with something else
+			tokens = os.read(int(rfd), 65536)
+			os.write(int(wfd), tokens)
+
+			# the number of tokens is off by one for obscure but well documented reason
+			# see man make for more details
+			options.jobs = len(tokens) + 1
+
+		# do we have unlimited parallelism?
+		elif jobopt and jobopt != "-j1":
+			# check that this actually make sense
+			if jobopt != "-j":
+				print("ERROR: -j option passed by make but no {}, was '{}'".format(jobstr, jobopt), file=sys.stderr)
+				sys.exit(1)
+
+			options.jobs = eval_hardware()
+			flags = prep_unlimited_recursive_make()
+
+
+		# then no parallelism
+		else:
+			options.jobs = 1
+
+		# keep all flags make passed along, except the weird 'w' which is about subdirectories
+		flags = [f for f in make_flags if f != 'w']
+
+	# Arguments are calling the shots, fake the top level make
+	elif options.jobs :
+
+		# make sure we have a valid number of jobs that corresponds to user input
+		if options.jobs < 0 :
+			print('ERROR: Invalid number of jobs', file=sys.stderr)
+			sys.exit(1)
+
+		flags = prep_recursive_make(options.jobs)
+
+	# Arguments are calling the shots, fake the top level make, but 0 is a special case
+	elif options.jobs == 0:
+		options.jobs = eval_hardware()
+		flags = prep_unlimited_recursive_make()
+
+	# No one says to run in parallel, then don't
 	else :
-		force = True
-
-	# make sure we have a valid number of jobs that corresponds to user input
-	if options.jobs <= 0 :
-		print('ERROR: Invalid number of jobs', file=sys.stderr)
-		sys.exit(1)
-
-	return min( options.jobs, len(tests) ), force
+		options.jobs = 1
+		flags = []
+
+	# Make sure we call make as expected
+	settings.update_make_cmd( flags )
+
+	# return the job count
+	return options.jobs
 
 # enable core dumps for all the test children
@@ -334,5 +443,5 @@
 	distcc_hash = os.path.join(settings.SRCDIR, '../tools/build/distcc_hash')
 	config = "%s-%s" % (settings.arch.target, settings.debug.path)
-	_, out = sh(distcc_hash, config, output_file=subprocess.PIPE, ignore_dry_run=True)
+	_, out, _ = sh(distcc_hash, config, output_file=subprocess.PIPE, ignore_dry_run=True)
 	return out.strip()
 
@@ -374,8 +483,12 @@
 
 	if not os.path.isfile(core):
-		return 1, "ERR No core dump (limit soft: {} hard: {})".format(*resource.getrlimit(resource.RLIMIT_CORE))
+		return 1, "ERR No core dump, expected '{}' (limit soft: {} hard: {})".format(core, *resource.getrlimit(resource.RLIMIT_CORE))
 
 	try:
-		return sh('gdb', '-n', path, core, '-batch', '-x', cmd, output_file=subprocess.PIPE)
+		ret, out, err = sh('gdb', '-n', path, core, '-batch', '-x', cmd, output_file=subprocess.PIPE)
+		if ret == 0:
+			return 0, out
+		else:
+			return 1, err
 	except:
 		return 1, "ERR Could not read core with gdb"
Index: tests/test.py
===================================================================
--- tests/test.py	(revision eb3bc528997e6f7c0319bb7a75aa224152546e3f)
+++ tests/test.py	(revision 510e6f9430710d9525c9d3b03d0d08e6f2fe2de9)
@@ -140,5 +140,5 @@
 	parser.add_argument('--regenerate-expected', help='Regenerate the .expect by running the specified tets, can be used with --all option', action='store_true')
 	parser.add_argument('--archive-errors', help='If called with a valid path, on test crashes the test script will copy the core dump and the executable to the specified path.', type=str, default='')
-	parser.add_argument('-j', '--jobs', help='Number of tests to run simultaneously', type=int)
+	parser.add_argument('-j', '--jobs', help='Number of tests to run simultaneously, 0 (default) for unlimited', nargs='?', const=0, type=int)
 	parser.add_argument('--list-comp', help='List all valide arguments', action='store_true')
 	parser.add_argument('--list-dist', help='List all tests for distribution', action='store_true')
@@ -195,5 +195,5 @@
 	# build, skipping to next test on error
 	with Timed() as comp_dur:
-		make_ret, _ = make( test.target(), output_file=subprocess.DEVNULL, error=out_file, error_file = err_file )
+		make_ret, _, _ = make( test.target(), output_file=subprocess.DEVNULL, error=out_file, error_file = err_file )
 
 	# ----------
@@ -208,5 +208,5 @@
 				if settings.dry_run or is_exe(exe_file):
 					# run test
-					retcode, _ = sh(exe_file, output_file=out_file, input_file=in_file, timeout=True)
+					retcode, _, _ = sh(exe_file, output_file=out_file, input_file=in_file, timeout=True)
 				else :
 					# simply cat the result into the output
@@ -226,5 +226,5 @@
 			else :
 				# fetch return code and error from the diff command
-				retcode, error = diff(cmp_file, out_file)
+				retcode, error, _ = diff(cmp_file, out_file)
 
 		else:
@@ -366,8 +366,8 @@
 			print(os.path.relpath(t.expect(), settings.SRCDIR), end=' ')
 			print(os.path.relpath(t.input() , settings.SRCDIR), end=' ')
-			code, out = make_recon(t.target())
+			code, out, err = make_recon(t.target())
 
 			if code != 0:
-				print('ERROR: recond failed for test {}'.format(t.target()), file=sys.stderr)
+				print('ERROR: recond failed for test {}: {} \'{}\''.format(t.target(), code, err), file=sys.stderr)
 				sys.exit(1)
 
@@ -417,4 +417,6 @@
 			if is_empty(t.expect()):
 				print('WARNING: test "{}" has empty .expect file'.format(t.target()), file=sys.stderr)
+
+	options.jobs = job_count( options )
 
 	# for each build configurations, run the test
@@ -430,9 +432,8 @@
 			local_tests = settings.ast.filter( tests )
 			local_tests = settings.arch.filter( local_tests )
-			options.jobs, forceJobs = job_count( options, local_tests )
-			settings.update_make_cmd(forceJobs, options.jobs)
 
 			# check the build configuration works
 			settings.validate()
+			jobs = min(options.jobs, len(local_tests))
 
 			# print configuration
@@ -440,5 +441,5 @@
 				'Regenerating' if settings.generating else 'Running',
 				len(local_tests),
-				options.jobs,
+				jobs,
 				settings.ast.string,
 				settings.arch.string,
@@ -450,5 +451,5 @@
 
 			# otherwise run all tests and make sure to return the correct error code
-			failed = run_tests(local_tests, options.jobs)
+			failed = run_tests(local_tests, jobs)
 			if failed:
 				if not settings.continue_:
