Index: doc/theses/colby_parsons_MMAth/text/actors.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/actors.tex	(revision dc136d78a012caaea32981d24dfb812b9dcde436)
+++ doc/theses/colby_parsons_MMAth/text/actors.tex	(revision 14e105369536401baf2229cef802a60315acf849)
@@ -5,5 +5,4 @@
 % ======================================================================
 
-% C_TODO: add citations throughout chapter
 Actors are an indirect concurrent feature that abstracts threading away from a programmer, and instead provides \gls{actor}s and messages as building blocks for concurrency, where message passing means there is no shared data to protect, making actors amenable in a distributed environment.
 Actors are another message passing concurrency feature, similar to channels but with more abstraction, and are in the realm of \gls{impl_concurrency}, where programmers write concurrent code without dealing with explicit thread creation or interaction.
@@ -423,6 +422,4 @@
 Push operations are amortized $O(1)$ since pushes may cause doubling reallocations of the underlying dynamic-sized array (like \CC @vector@).
 
-% C_TODO: maybe make copy_queue diagram
-
 Since the copy queue is an array, envelopes are allocated first on the stack and then copied into the copy queue to persist until they are no longer needed.
 For many workload, the copy queues grow in size to facilitate the average number of messages in flight and there is no further dynamic allocations.
@@ -834,7 +831,6 @@
 In another example, if the average gulp size is very high, it could indicate that the executor could use more queue sharding.
 
-% C_TODO cite poison pill messages and add languages
 Another productivity feature that is included is a group of poison-pill messages.
-Poison-pill messages are common across actor systems, including Akka and ProtoActor \cite{}.
+Poison-pill messages are common across actor systems, and are used in actor libraries Akka and ProtoActor~\cite{Akka,ProtoActor}.
 Poison-pill messages inform an actor to terminate.
 In \CFA, due to the allocation of actors and lack of garbage collection, there needs to be a suite of poison-pills.
@@ -881,7 +877,7 @@
 	& \multicolumn{1}{c|}{\CFA (100M)} & \multicolumn{1}{c|}{CAF (10M)} & \multicolumn{1}{c|}{Akka (100M)} & \multicolumn{1}{c|}{\uC (100M)} & \multicolumn{1}{c@{}}{ProtoActor (100M)} \\
 	\hline
-	AMD		& \input{data/pykeSendStatic} \\
+	AMD		& \input{data/nasusSendStatic} \\
 	\hline
-	Intel	& \input{data/nasusSendStatic}
+	Intel	& \input{data/pykeSendStatic}
 \end{tabular}
 
@@ -894,7 +890,7 @@
 	& \multicolumn{1}{c|}{\CFA (20M)} & \multicolumn{1}{c|}{CAF (2M)} & \multicolumn{1}{c|}{Akka (2M)} & \multicolumn{1}{c|}{\uC (20M)} & \multicolumn{1}{c@{}}{ProtoActor (2M)} \\
 	\hline
-	AMD		& \input{data/pykeSendDynamic} \\
+	AMD		& \input{data/nasusSendDynamic} \\
 	\hline
-	Intel	& \input{data/nasusSendDynamic}
+	Intel	& \input{data/pykeSendDynamic}
 \end{tabular}
 \end{table}
@@ -916,4 +912,8 @@
 The results from the static/dynamic send benchmarks are shown in Figures~\ref{t:StaticActorMessagePerformance} and \ref{t:DynamicActorMessagePerformance} respectively.
 \CFA leads the charts in both benchmarks, largely due to the copy queue removing the majority of the envelope allocations.
+Additionally, the receive of all messages sent in \CFA is statically known and is determined via a function pointer cast, which incurrs a compile-time cost.
+All the other systems use their virtual system to find the correct behaviour at message send.
+This requires two virtual dispatch operations, which is an additional runtime send cost that \CFA does not have.
+Note that Akka also statically checks message sends, but still uses their virtual system at runtime.
 In the static send benchmark all systems except CAF have static send costs that are in the same ballpark, only varying by ~70ns.
 In the dynamic send benchmark all systems experience slower message sends, as expected due to the extra allocations.
@@ -1084,6 +1084,7 @@
 
 Figure~\ref{t:ExecutorMemory} shows the high memory watermark of the actor systems when running the executor benchmark on 48 cores.
-\CFA has a high watermark relative to the other non-garbage collected systems \uC, and CAF.
+\CFA has a high watermark relative to the other non-garbage-collected systems \uC, and CAF.
 This is a result of the copy queue data structure, as it will over-allocate storage and not clean up eagerly, whereas the per envelope allocations will always allocate exactly the amount of storage needed.
+Despite having a higher watermark, the \CFA memory usage remains comparable to other non-garbage-collected systems.
 
 \subsection{Matrix Multiply}
Index: doc/theses/colby_parsons_MMAth/text/conclusion.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/conclusion.tex	(revision 14e105369536401baf2229cef802a60315acf849)
+++ doc/theses/colby_parsons_MMAth/text/conclusion.tex	(revision 14e105369536401baf2229cef802a60315acf849)
@@ -0,0 +1,33 @@
+% ======================================================================
+% ======================================================================
+\chapter{Conclusion}\label{s:conclusion}
+% ======================================================================
+% ======================================================================
+This thesis presented a suite of safe and efficient concurrency tools that provide users with the means to write scalable programs in \CFA through many avenues.
+If users prefer the message passing paradigm of concurrency, \CFA now provides tools in the form of a performant actor system and channels.
+For shared memory concurrency the mutex statement provides a safe and easy-to-use interface for mutual exclusion.
+The waituntil statement provided by this works aids in writing concurrent programs in both the message passing and shared memory worlds of concurrency.
+Furthermore no other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil.
+From the novel copy queue data structure in the actor system, to the plethora of user-supporting safety features, all these utilities build upon existing tools with value added.
+
+\section{Future Work}
+\subsection{Further Implicit Concurrency}
+This thesis scratches the surface of implicit concurrency by providing an actor system.
+There is room for more implicit concurrency tools in \CFA.
+User-defined implicit concurrency in the form of annotated loops or recursive functions exists in many other languages~\cite{} and could be implemented and expanded on in \CFA.
+Additionally, the problem of automatic parallelism of sequential programs via the compiler is an interesting research space that other languages have approached~\cite{} that could also be explored in \CFA.
+
+
+\subsection{Synchronously Multiplexing System Calls}
+There are many tools that try to sychronize on or asynchronously check I/O, since improvements in this area pay dividends in many areas of computer science~\cite{}. %cite all the poll/iouring utilities
+Research on improving user-space tools to synchronize over I/O and other system calls is an interesting area to explore in the world of concurrent tooling.
+
+\subsection{Better Atomics}
+When writing low level concurrent programs, expecially lock/wait-free programs, low level atomic instructions need to be used.
+In C, the gcc-builtin atomics~\cite{} are commonly used, but leave much to be desired.
+Some of the problems include the following.
+Archaic and opaque macros often have to be used to ensure that atomic assembly is generated instead of locks.
+The builtins are polymorphic, but not type safe since they use void pointers.
+The semantics and safety of these builtins require careful navigation since they require the user to have a nuanced understanding of concurrent memory ordering models to pass via flags.
+Furthermore, these atomics also often require a user to understand how to fence appropriately to ensure correctness.
+All these problems and more could benefit from language support, and adding said language support in \CFA could constitute a great research contribution, and allow for easier writing of low-level safe concurrent code.
Index: doc/theses/colby_parsons_MMAth/text/waituntil.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/waituntil.tex	(revision dc136d78a012caaea32981d24dfb812b9dcde436)
+++ doc/theses/colby_parsons_MMAth/text/waituntil.tex	(revision 14e105369536401baf2229cef802a60315acf849)
@@ -80,5 +80,5 @@
 This enables fully expressive \gls{synch_multiplex} predicates.
 
-There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
+There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
 Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources.
 
@@ -99,4 +99,6 @@
 All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls.
 The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
+No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil.
+All others them tie themselves to some specific type of resource.
 
 \begin{figure}
@@ -370,18 +372,173 @@
 
 \subsection{Channel Benchmark}
-The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
-
-%C_TODO explain benchmark
-
-%C_TODO show results
-
-%C_TODO discuss results
+The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
+The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads.
+The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses.
+Each clause has a respective channel that is operates on.
+Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels.
+An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows.
+
+\begin{cfa}
+    for (;;)
+        waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {} 
+        or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {}
+\end{cfa}
+A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
+The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
+The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
+
+\begin{figure}
+	\centering
+    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
+	}
+    \bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
+	}
+    \bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
+	}
+	\caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
+	\label{f:select_contend_bench}
+\end{figure}
+
+\begin{figure}
+	\centering
+    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
+	}
+    \bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
+	}
+    \bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
+	}
+	\caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
+	\label{f:select_spin_bench}
+\end{figure}
+
+Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
+In the AMD benchmarks, the performance is very similar as the number of cores scale.
+The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
+At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
+Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}.
+In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
+This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
+Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
+In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
+This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
+
+The Go approach of holding all internal channel locks in the select has some additional drawbacks.
+This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
+Consider the case where there are two channels, @A@ and @B@.
+There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
+Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
+Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
+This case may not be as pathological as it may seem.
+If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
+The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
+Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
+The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
+
+\begin{table}[t]
+\centering
+\setlength{\extrarowheight}{2pt}
+\setlength{\tabcolsep}{5pt}
+
+\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
+\label{t:pathGo}
+\begin{tabular}{*{5}{r|}r}
+    & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
+    \hline
+    AMD		& \input{data/nasus_Order} \\
+    \hline
+    Intel	& \input{data/pyke_Order}
+\end{tabular}
+\end{table}
+
+Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
+Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
+This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
+If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
 
 \subsection{Future Benchmark}
 The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures.
-
-%C_TODO explain benchmark
-
-%C_TODO show results
-
-%C_TODO discuss results
+Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic. 
+They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
+The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
+
+\begin{figure}
+	\centering
+	\subfloat[AMD Future Synchronization Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
+		\label{f:futureAMD}
+	}
+	\subfloat[Intel Future Synchronization Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
+		\label{f:futureIntel}
+	}
+	\caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
+    \caption{}
+	\label{f:futurePerf}
+\end{figure}
+
+This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
+This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
+Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
+Each set of columns is marked with a name representing the predicate for that set of columns.
+The predicate name and corresponding waituntil statement is shown below:
+
+\begin{cfa}
+#ifdef OR
+waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+#endif
+#ifdef AND
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+and waituntil( C ) { get( C ); }
+#endif
+#ifdef ANDOR
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+#endif
+#ifdef ORAND
+(waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
+and waituntil( C ) { get( C ); }
+#endif
+\end{cfa}
+
+In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators.
+For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
+Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
