Index: doc/theses/colby_parsons_MMAth/text/waituntil.tex
===================================================================
--- doc/theses/colby_parsons_MMAth/text/waituntil.tex	(revision d4c1d1f51797f1bdf08c6cfa4ebd67c513877cfc)
+++ doc/theses/colby_parsons_MMAth/text/waituntil.tex	(revision 1ae3f1852e2160c2258f0f9f108b1d6c566000aa)
@@ -24,5 +24,6 @@
 Waiting for the first appropriate stall (resource) that becomes available without spinning is an example of \gls{synch_multiplex}: the ability to wait synchronously for one or more resources based on some selection criteria.
 
-\section{History of Synchronous Multiplexing}
+\section{History of Synchronous Multiplexing}\label{s:History}
+
 There is a history of tools that provide \gls{synch_multiplex}.
 Some well known \gls{synch_multiplex} tools include Unix system utilities: @select@~\cite{linux:select}, @poll@~\cite{linux:poll}, and @epoll@~\cite{linux:epoll}, and the @select@ statement provided by Go~\cite{go:selectref}, Ada~\cite[\S~9.7]{Ada16}, and \uC~\cite[\S~3.3.1]{uC++}.
@@ -560,11 +561,11 @@
 \multicolumn{2}{@{}l@{}}{\lstinline{channel A, B; // zero size channels}} \\
 thread 1 & thread 2 \\
-\begin{cfa}
+\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
 waituntil( @i << A@ ) {}
-or waituntil( @i << B@ ) {}
+or waituntil( #i << B# ) {}
 \end{cfa}
 &
-\begin{cfa}
-waituntil( @B << 2}@ ) {}
+\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
+waituntil( #B << 2# ) {}
 or waituntil( @A << 1@ ) {}
 \end{cfa}
@@ -620,8 +621,8 @@
 In the following example, the code blocks run once their corresponding predicate inside the round braces is satisfied.
 % C_TODO put this is uC++ code style not cfa-style
-\begin{lstlisting}[language=uC++]
+\begin{lstlisting}[language=uC++,{moredelim=**[is][\color{red}]{@}{@}}]
 Future_ISM<int> A, B, C, D;
-_Select( A || B && C ) { ... }
-and _Select( D && E ) { ... }
+_Select( @A || B && C@ ) { ... }
+and _Select( @D && E@ ) { ... }
 \end{lstlisting}
 This is more expressive that the @waituntil@ statement in \CFA, allowing the code block for @&&@ to only run after both resources are available.
@@ -699,20 +700,20 @@
 \section{Waituntil Performance}
 
-Similar facilities to @waituntil@ are discussed at the start of this chapter in C, Ada, Rust, \CC, and OCaml.
+Similar facilities to @waituntil@ are discussed in Section~\ref{s:History} covering C, Ada, Rust, \CC, and OCaml.
 However, these facilities are either not meaningful or feasible to benchmark against.
 The UNIX @select@ and related utilities are not comparable since they are system calls that go into the kernel and operate on file descriptors, whereas the @waituntil@ exists solely in user space.
-Ada's @select@ only operates on methods, which is done in \CFA via the @waitfor@ utility so it is not meaningful to benchmark against the @waituntil@, which cannot wait on the same resource.
-Rust and \CC only offer a busy-wait based approach, which is not comparable to a blocking approach.
+Ada's \lstinline[language=Ada]{select} and \uC's \lstinline[language=uC++]{_Accept} only operates on method calls, which is done in \CFA via the @waitfor@ statement, so it is not meaningful to benchmark against the @waituntil@, which cannot wait on this resource.
+Rust and \CC only offer a busy-wait approach, which is not comparable to a blocking approach.
 OCaml's @select@ waits on channels that are not comparable with \CFA and Go channels, so OCaml @select@ is not benchmarked against Go's @select@ and \CFA's @waituntil@.
 
-The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go @select@ statement and the \uC @_Select@ statement.
-As such, two microbenchmarks are presented, one for Go and one for \uC to contrast the systems.
-Given the differences in features, polymorphism, and expressibility between @waituntil@ and @select@, and @_Select@, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
+The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go \lstinline[language=Go]{select} statement and the \uC \lstinline[language=uC++]{_Select} statement.
+As such, two microbenchmarks are presented, one for Go and one for \uC to contrast this feature.
+Given the differences in features, polymorphism, and expressibility between @waituntil@ and \lstinline[language=Go]{select}, and \uC \lstinline[language=uC++]{_Select}, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
 
 \subsection{Channel Benchmark}
 
-The channel multiplexing microbenchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
-The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
-The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has different channel that is waits on.
+The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
+The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
+The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that is waits on.
 Each producer and consumer repeatedly waits to either produce or consume from one of the $C$ clauses and respective channels.
 For example, in \CFA syntax, the work loop in the consumer main with $C = 4$ clauses is:
@@ -723,5 +724,5 @@
 \end{cfa}
 A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
-The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
+The first benchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels using the Go @default@ and \CFA @else@ clause.
 The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
 
@@ -784,24 +785,47 @@
 \end{figure}
 
-Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
-In the AMD benchmarks, the performance is very similar as the number of cores scale.
-The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
-At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
-Go heavily optimizes thread handoffs on their local run-queue, which can result in very good performance for low numbers of threads which are parking/unparking each other~\cite{go:sched}.
-In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
-This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
-Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
+Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@.
+In the AMD benchmarks (left column), the performance is very similar as the number of cores scale.
+The AMD machine has a high-caching contention cost because of its \emph{chicklet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go.
+Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler.
+Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}.
+In the Intel benchmarks (right column), \CFA performs better than Go as the number of cores scales past 2/4 and as the number of clauses increase.
+This difference is due to Go's acquiring all channel locks when registering and unregistering channels on a \lstinline[language=Go]{select}.
+Go then is holding a lock for every channel, resulting in worse performance as the number of channels increase.
 In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
-This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
-
-The Go approach of holding all internal channel locks in the select has some additional drawbacks.
-This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
-Consider the case where there are two channels, @A@ and @B@.
-There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
-Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
-Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
-This case may not be as pathological as it may seem.
-If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
-The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
+This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has lower cache contention costs.
+
+The Go approach of holding all internal channel-locks in the \lstinline[language=Go]{select} has additional drawbacks.
+There are pathological cases where Go's throughput has significant jitter.
+Consider a producer and consumer thread, @P1@ and @C1@, selecting from both channels @A@ and @B@.
+\begin{cquote}
+\begin{tabular}{@{}ll@{}}
+@P1@ & @C1@ \\
+\begin{cfa}
+waituntil( A << i ); or waituntil( B << i );
+\end{cfa}
+&
+\begin{cfa}
+waituntil( val << A ); or waituntil( val << B );
+\end{cfa}
+\end{tabular}
+\end{cquote}
+Additionally, there is another producer and consumer thread, @P2@ and @C2@, operating solely on @B@.
+\begin{cquote}
+\begin{tabular}{@{}ll@{}}
+@P2@ & @C2@ \\
+\begin{cfa}
+B << val;
+\end{cfa}
+&
+\begin{cfa}
+val << B;
+\end{cfa}
+\end{tabular}
+\end{cquote}
+In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
+Interesting, this case may not be as pathological as it seems.
+If the set of channels belonging to a select have channels that overlap with the set of another select, these statements lose the ability to operate in parallel.
+The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism.
 Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
 The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
@@ -812,8 +836,8 @@
 \setlength{\tabcolsep}{5pt}
 
-\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
+\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically case for contention in Go's select implementation}
 \label{t:pathGo}
-\begin{tabular}{*{5}{r|}r}
-	& \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
+\begin{tabular}{r|r|r}
+			& \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c}{Go} \\
 	\hline
 	AMD		& \input{data/nasus_Order} \\
@@ -824,13 +848,15 @@
 
 Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
-Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
-This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{t:pathGo}.
-If \CFA didn't have priorities, the performance difference in Table~\ref{t:pathGo} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
+Go \emph{randomly} selects a clause~\cite{go:select}, but \CFA chooses in the order clauses are listed.
+This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour and even better performance.
+In the previous example, threads @P1@ and @C1@ prioritize channel @A@ in the @waituntil@, which can reduce contention for threads @P2@ and @C2@ accessing channel @B@.
+If \CFA did not have priorities, the performance difference in Table~\ref{t:pathGo} would be significant less due to extra contention on channel @B@.
 
 \subsection{Future Benchmark}
-The future benchmark compares \CFA's @waituntil@ with \uC's @_Select@, with both utilities waiting on futures.
-Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic. 
-They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
-The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
+
+The future benchmark compares \CFA's @waituntil@ with \uC's \lstinline[language=uC++]{_Select}, with both utilities waiting on futures.
+While both statements have very similar semantics, supporting @and@ and @or@ operators, \lstinline[language=uC++]{_Select} can only wait on futures, whereas the @waituntil@ is polymorphic. 
+As such, the underlying implementation of the operators differs between @waituntil@ and \lstinline[language=uC++]{_Select}.
+The @waituntil@ statement checks for statement completion using a predicate function, whereas the \lstinline[language=uC++]{_Select} statement maintains a tree that represents the state of the internal predicate.
 
 \begin{figure}
@@ -849,34 +875,47 @@
 \end{figure}
 
-This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
-This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
-Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
-Each set of columns is marked with a name representing the predicate for that set of columns.
-The predicate name and corresponding @waituntil@ statement is shown below:
-
-\begin{cfa}
-#ifdef OR
+This benchmark aims to indirectly measure the impact of various predicates on the performance of the @waituntil@ and \lstinline[language=uC++]{_Select} statements.
+The benchmark is indirect since the performance of futures in \CFA and \uC differ by a significant margin.
+The experiment has a server, which cycles fulfilling three futures, @A@, @B@, and @C@, and a client, which waits for these futures to be fulfilled using four different kinds of predicates given in \CFA:
+\begin{cquote}
+\begin{tabular}{@{}l|l@{}}
+OR & AND \\
+\hline
+\begin{cfa}
 waituntil( A ) { get( A ); }
 or waituntil( B ) { get( B ); }
 or waituntil( C ) { get( C ); }
-#endif
-#ifdef AND
+\end{cfa}
+&
+\begin{cfa}
 waituntil( A ) { get( A ); }
 and waituntil( B ) { get( B ); }
 and waituntil( C ) { get( C ); }
-#endif
-#ifdef ANDOR
+\end{cfa}
+\\
+\multicolumn{2}{@{}c@{}}{} \\
+AND-OR & OR-AND \\
+\hline
+\begin{cfa}
 waituntil( A ) { get( A ); }
 and waituntil( B ) { get( B ); }
 or waituntil( C ) { get( C ); }
-#endif
-#ifdef ORAND
-(waituntil( A ) { get( A ); }
-or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
+\end{cfa}
+&
+\begin{cfa}
+@(@ waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); } @)@
 and waituntil( C ) { get( C ); }
-#endif
-\end{cfa}
-
-In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of @waituntil@ statements with only @or@ operators.
-For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
+\end{cfa}
+\end{tabular}
+\end{cquote}
+The server and client use a low cost synchronize after each fulfillment, so the server does not race ahead of the client.
+
+Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
+Each pair of bars is marked with the predicate name for that experiment and the value at the top of each bar is the standard deviation..
+In detail, \uC results are lower in all cases due to the performance difference between futures and the more complex \gls{synch_multiplex} implementation.
+However, the bars for both systems have similar height patterns across the experiments.
+The @OR@ column for \CFA is more performant than the other \CFA predicates, due to the special-casing of @waituntil@ statements with only @or@ operators.
+For both \uC and \CFA, the @AND@ experiment is the least performant, which is expected since all three futures need to be fulfilled for each statement completion.
 Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
+Given the differences in semantics and implementation between \uC and \CFA, this test only illustrates the overall costs among the different kinds of predicates.