Index: doc/theses/colby_parsons_MMath/text/waituntil.tex
===================================================================
--- doc/theses/colby_parsons_MMath/text/waituntil.tex	(revision f945fa7ba076d346ab1aa115747f79712d70331d)
+++ doc/theses/colby_parsons_MMath/text/waituntil.tex	(revision f945fa7ba076d346ab1aa115747f79712d70331d)
@@ -0,0 +1,1041 @@
+% ======================================================================
+% ======================================================================
+\chapter{Waituntil}\label{s:waituntil}
+% ======================================================================
+% ======================================================================
+
+Consider the following motivating problem.
+There are $N$ stalls (resources) in a bathroom and there are $M$ people (threads) using the bathroom.
+Each stall has its own lock since only one person may occupy a stall at a time.
+Humans solve this problem in the following way.
+They check if all of the stalls are occupied.
+If not, they enter and claim an available stall.
+If they are all occupied, people queue and watch the stalls until one is free, and then enter and lock the stall.
+This solution can be implemented on a computer easily, if all threads are waiting on all stalls and agree to queue.
+
+Now the problem is extended.
+Some stalls are wheelchair accessible and some stalls have gender identification.
+Each person (thread) may be limited to only one kind of stall or may choose among different kinds of stalls that match their criteria.
+Immediately, the problem becomes more difficult.
+A single queue no longer solves the problem.
+What happens when there is a stall available that the person at the front of the queue cannot choose?
+The na\"ive solution has each thread spin indefinitely continually checking every matching kind of stall(s) until a suitable one is free.
+This approach is insufficient since it wastes cycles and results in unfairness among waiting threads as a thread can acquire the first matching stall without regard to the waiting time of other threads.
+Waiting for the first appropriate stall (resource) that becomes available without spinning is an example of \gls{synch_multiplex}: the ability to wait synchronously for one or more resources based on some selection criteria.
+
+\section{History of Synchronous Multiplexing}\label{s:History}
+
+There is a history of tools that provide \gls{synch_multiplex}.
+Some well known \gls{synch_multiplex} tools include Unix system utilities: @select@~\cite{linux:select}, @poll@~\cite{linux:poll}, and @epoll@~\cite{linux:epoll}, and the @select@ statement provided by Go~\cite{go:selectref}, Ada~\cite[\S~9.7]{Ada16}, and \uC~\cite[\S~3.3.1]{uC++}.
+The concept and theory surrounding \gls{synch_multiplex} was introduced by Hoare in his 1985 book, Communicating Sequential Processes (CSP)~\cite{Hoare85},
+\begin{quote}
+A communication is an event that is described by a pair $c.v$ where $c$ is the name of the channel on which the communication takes place and $v$ is the value of the message which passes.~\cite[p.~113]{Hoare85}
+\end{quote}
+The ideas in CSP were implemented by Roscoe and Hoare in the language Occam~\cite{Roscoe88}.
+
+Both CSP and Occam include the ability to wait for a \Newterm{choice} among receiver channels and \Newterm{guards} to toggle which receives are valid.
+For example,
+\begin{cfa}[mathescape]
+(@G1@(x) $\rightarrow$ P @|@ @G2@(y) $\rightarrow$ Q )
+\end{cfa}
+waits for either channel @x@ or @y@ to have a value, if and only if guards @G1@ and @G2@ are true;
+if only one guard is true, only one channel receives, and if both guards are false, no receive occurs.
+% extended CSP with a \gls{synch_multiplex} construct @ALT@, which waits for one resource to be available and then executes a corresponding block of code.
+In detail, waiting for one resource out of a set of resources can be thought of as a logical exclusive-or over the set of resources.
+Guards are a conditional operator similar to an @if@, except they apply to the resource being waited on.
+If a guard is false, then the resource it guards is not in the set of resources being waited on.
+If all guards are false, the ALT, Occam's \gls{synch_multiplex} statement, does nothing and the thread continues.
+Guards can be simulated using @if@ statements as shown in~\cite[rule~2.4, p~183]{Roscoe88}
+\begin{lstlisting}[basicstyle=\rm,mathescape]
+ALT( $b$ & $g$ $P$, $G$ ) = IF ( $b$ ALT($\,g$ $P$, $G$ ), $\neg\,$b ALT( $G$ ) )			 (boolean guard elim).
+\end{lstlisting}
+but require $2^N-1$ @if@ statements, where $N$ is the number of guards.
+The exponential blowup comes from applying rule 2.4 repeatedly, since it works on one guard at a time.
+Figure~\ref{f:wu_if} shows in \CFA an example of applying rule 2.4 for three guards.
+Also, notice the additional code duplication for statements @S1@, @S2@, and @S3@.
+
+\begin{figure}
+\centering
+\begin{lrbox}{\myboxA}
+\begin{cfa}
+when( G1 )
+	waituntil( R1 ) S1 
+or when( G2 )
+	waituntil( R2 ) S2
+or when( G3 )
+	waituntil( R3 ) S3
+
+
+
+
+
+
+
+\end{cfa}
+\end{lrbox}
+
+\begin{lrbox}{\myboxB}
+\begin{cfa}
+if ( G1 )
+	if ( G2 )
+		if ( G3 ) waituntil( R1 ) S1 or waituntil( R2 ) S2 or waituntil( R3 ) S3
+		else waituntil( R1 ) S1 or waituntil( R2 ) S2
+	else
+		if ( G3 ) waituntil( R1 ) S1 or waituntil( R3 ) S3
+		else waituntil( R1 ) S1
+else
+	if ( G2 )
+		if ( G3 ) waituntil( R2 ) S2 or waituntil( R3 ) S3
+		else waituntil( R2 ) S2
+	else
+		if ( G3 ) waituntil( R3 ) S3
+\end{cfa}
+\end{lrbox}
+
+\subfloat[Guards]{\label{l:guards}\usebox\myboxA}
+\hspace*{5pt}
+\vrule
+\hspace*{5pt}
+\subfloat[Simulated Guards]{\label{l:simulated_guards}\usebox\myboxB}
+\caption{\CFA guard simulated with \lstinline{if} statement.}
+\label{f:wu_if}
+\end{figure}
+
+When discussing \gls{synch_multiplex} implementations, the resource being multiplexed is important.
+While CSP waits on channels, the earliest known implementation of synch\-ronous multiplexing is Unix's @select@~\cite{linux:select}, multiplexing over file descriptors, which conceptually differ from channels in name only.
+The @select@ system-call is passed three sets of file descriptors (read, write, exceptional) to wait on and an optional timeout.
+@select@ blocks until either some subset of file descriptors are available or the timeout expires.
+All file descriptors that are ready are returned by modifying the argument sets to only contain the ready descriptors.
+
+This early implementation differs from the theory presented in CSP: when the call from @select@ returns it may provide more than one ready file descriptor.
+As such, @select@ has logical-or multiplexing semantics, whereas the theory described exclusive-or semantics.
+It is possible to achieve exclusive-or semantics with @select@ by arbitrarily operating on only one of the returned descriptors.
+@select@ passes the interest set of file descriptors between application and kernel in the form of a worst-case sized bit-mask, where the worst-case is the largest numbered file descriptor.
+@poll@ reduces the size of the interest sets changing from a bit mask to a linked data structure, independent of the file-descriptor values.
+@epoll@ further reduces the data passed per call by keeping the interest set in the kernel, rather than supplying it on every call.
+
+These early \gls{synch_multiplex} tools interact directly with the operating system and others are used to communicate among processes.
+Later, \gls{synch_multiplex} started to appear in applications, via programming languages, to support fast multiplexed concurrent communication among threads.
+An early example of \gls{synch_multiplex} is the @select@ statement in Ada~\cite[\S~9.7]{Ichbiah79}.
+This @select@ allows a task object, with their own threads, to multiplex over a subset of asynchronous calls to its methods.
+The Ada @select@ has the same exclusive-or semantics and guards as Occam ALT;
+however, it multiplexes over methods rather than channels.
+
+\begin{figure}
+\begin{lstlisting}[language=ada,literate=,{moredelim={**[is][\color{red}]{@}{@}}}]
+task type buffer is -- thread type
+	... -- buffer declarations
+	count : integer := 0;
+begin -- thread starts here
+	loop
+		select
+			@when count < Size@ => -- guard
+			@accept insert( elem : in ElemType )@ do  -- method
+				... -- add to buffer
+				count := count + 1;
+			end;
+			-- executed if this accept called
+		or
+			@when count > 0@ => -- guard
+			@accept remove( elem : out ElemType )@ do  -- method
+				... --remove and return from buffer via parameter
+				count := count - 1;
+			end;
+			-- executed if this accept called
+		or @delay 10.0@;  -- unblock after 10 seconds without call
+		or @else@ -- do not block, cannot appear with delay
+		end select;
+	end loop;
+end buffer;
+var buf : buffer; -- create task object and start thread in task body
+\end{lstlisting}
+\caption{Ada Bounded Buffer}
+\label{f:BB_Ada}
+\end{figure}
+
+Figure~\ref{f:BB_Ada} shows the outline of a bounded buffer implemented with an Ada task.
+Note that a task method is associated with the \lstinline[language=ada]{accept} clause of the \lstinline[language=ada]{select} statement, rather than being a separate routine.
+The thread executing the loop in the task body blocks at the \lstinline[language=ada]{select} until a call occurs to @insert@ or @remove@.
+Then the appropriate \lstinline[language=ada]{accept} method is run with the called arguments.
+Hence, the \lstinline[language=ada]{select} statement provides rendezvous points for threads, rather than providing channels with message passing.
+The \lstinline[language=ada]{select} statement also provides a timeout and @else@ (nonblocking), which changes synchronous multiplexing to asynchronous.
+Now the thread polls rather than blocks.
+
+Another example of programming-language \gls{synch_multiplex} is Go using a @select@ statement with channels~\cite{go:selectref}.
+Figure~\ref{l:BB_Go} shows the outline of a bounded buffer administrator implemented with a Go routine.
+Here two channels are used for inserting and removing by client producers and consumers, respectively.
+(The @done@ channel is used to synchronize with the program main.)
+Go's @select@ has the same exclusive-or semantics as the ALT primitive from Occam and associated code blocks for each clause like ALT and Ada.
+However, unlike Ada and ALT, Go does not provide guards for the \lstinline[language=go]{case} clauses of the \lstinline[language=go]{select}.
+As such, the exponential blowup can be seen comparing Go and \uC in Figure~\ref{f:AdaMultiplexing}.
+Go also provides a timeout via a channel and a @default@ clause like Ada @else@ for asynchronous multiplexing.
+
+\begin{figure}
+\centering
+
+\begin{lrbox}{\myboxA}
+\begin{lstlisting}[language=go,literate=,{moredelim={**[is][\color{red}]{@}{@}}}]
+insert := make( chan int )
+remove := make( chan * int )
+buffer := make( chan int Size )
+done := make( chan int )
+count := 0
+func in_buf( int val ) {
+	buffer <- val
+	count++
+}
+func out_buf( int * ptr ) {
+	*ptr := <-buffer
+	count--
+}
+func BoundedBuffer {
+	L: for {
+		if ( count < Size && count > 0 ) {
+			select { // wait for message
+				@case i := <- insert: in_buf( i )@
+				@case p := <- remove: out_buf( p )@
+				case <- done: break L
+			}
+		} else if ( count < Size ) {
+			select { // wait for message
+				@case i := <- insert: in_buf( i )@
+				case <- done: break L
+			}
+		} else ( count > 0 ) {
+			select { // wait for message
+				@case p := <- remove: out_buf( p )@
+				case <- done: break L;
+			}
+		}
+	}
+	done <- 0
+}
+func main() {
+	go BoundedBuffer() // start administrator
+}
+\end{lstlisting}
+\end{lrbox}
+
+\begin{lrbox}{\myboxB}
+\begin{lstlisting}[language=uC++,{moredelim={**[is][\color{red}]{@}{@}}}]
+
+
+
+
+
+
+
+
+
+
+
+
+_Task BoundedBuffer {
+	int * buffer;
+	int front = back = count = 0;
+  public:
+	// ... constructor implementation
+	void insert( int elem ) {
+		buffer[front] = elem;
+		front = ( front + 1 ) % Size;
+		count += 1;
+	}
+	int remove() {
+		int ret = buffer[back];
+		back = ( back + 1 ) % Size;
+		count -= 1;
+		return ret;
+	}
+  private:
+	void main() {
+		for ( ;; ) {
+			_Accept( ~buffer ) break;
+			@or _When ( count < Size ) _Accept( insert )@;
+			@or _When ( count > 0 ) _Accept( remove )@;
+		}
+	}
+};
+buffer buf; // start thread in main method
+\end{lstlisting}
+\end{lrbox}
+
+\subfloat[Go]{\label{l:BB_Go}\usebox\myboxA}
+\hspace*{5pt}
+\vrule
+\hspace*{5pt}
+\subfloat[\uC]{\label{l:BB_uC++}\usebox\myboxB}
+
+\caption{Bounded Buffer}
+\label{f:AdaMultiplexing}
+\end{figure}
+
+Finally, \uC provides \gls{synch_multiplex} with Ada-style @select@ over monitor and task methods with the @_Accept@ statement~\cite[\S~2.9.2.1]{uC++}, and over futures with the @_Select@ statement~\cite[\S~3.3.1]{uC++}.
+The @_Select@ statement extends the ALT/Go @select@ by offering both @and@ and @or@ semantics, which can be used together in the same statement.
+Both @_Accept@ and @_Select@ statements provide guards for multiplexing clauses, as well as, timeout, and @else@ clauses.
+Figure~\ref{l:BB_uC++} shows the outline of a bounded buffer administrator implemented with \uC @_Accept@ statements.
+
+There are other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, Java's @allof@/@anyof@ over futures~\cite{java:allof:anyof}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
+Note that while C++14 and Rust provide \gls{synch_multiplex}, the implementations leave much to be desired as both rely on polling to wait on multiple resources.
+
+\section{Other Approaches to Synchronous Multiplexing}
+
+To avoid the need for \gls{synch_multiplex}, all communication among threads/processes must come from a single source.
+For example, in Erlang each process has a single heterogeneous mailbox that is the sole source of concurrent communication, removing the need for \gls{synch_multiplex} as there is only one place to wait on resources.
+Similar, actor systems circumvent the \gls{synch_multiplex} problem as actors only block when waiting for the next message never in a behaviour.
+While these approaches solve the \gls{synch_multiplex} problem, they introduce other issues.
+Consider the case where a thread has a single source of communication and it wants a set of $N$ resources.
+It must sequentially request the $N$ resources and wait for each response.
+During the receives for the $N$ resources, it can receive other communication, and has to save and postpone these communications, or discard them.
+% If the requests for the other resources need to be retracted, the burden falls on the programmer to determine how to synchronize appropriately to ensure that only one resource is delivered.
+
+\section{\CFA's Waituntil Statement}
+
+The new \CFA \gls{synch_multiplex} utility introduced in this work is the @waituntil@ statement.
+There already exists a @waitfor@ statement in \CFA that supports Ada-style \gls{synch_multiplex} over monitor methods~\cite{Delisle21}, so this @waituntil@ focuses on synchronizing over other resources.
+All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only waiting on one kind of resource: Unix @select@ supports file descriptors, Go's @select@ supports channel operations, \uC's @select@ supports futures, and Ada's @select@ supports monitor method calls.
+The \CFA @waituntil@ is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
+No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's @waituntil@.
+
+\begin{figure}
+\begin{cfa}
+forall(T & | sized(T))
+trait is_selectable {
+	// For registering a waituntil stmt on a selectable type
+	bool register_select( T &, select_node & );
+
+	// For unregistering a waituntil stmt from a selectable type
+	bool unregister_select( T &, select_node & );
+
+	// on_selected is run on the selecting thread prior to executing
+	// the statement associated with the select_node
+	bool on_selected( T &, select_node & );
+};
+\end{cfa}
+\caption{Trait for types that can be passed into \CFA's \lstinline{waituntil} statement.}
+\label{f:wu_trait}
+\end{figure}
+
+Currently locks, channels, futures and timeouts are supported by the @waituntil@ statement, and this set can be expanded through the @is_selectable@ trait as other use-cases arise.
+The @waituntil@ statement supports guard clauses, both @or@ and @and@ semantics, and timeout and @else@ for asynchronous multiplexing.
+Figure~\ref{f:wu_example} shows a \CFA @waituntil@ usage, which is waiting for either @Lock@ to be available \emph{or} for a value to be read from @Channel@ into @i@ \emph{and} for @Future@ to be fulfilled \emph{or} a timeout of one second. 
+Note that the expression inside a @waituntil@ clause is evaluated once at the start of the @waituntil@ algorithm.
+
+\section{Waituntil Semantics}
+
+The @waituntil@ semantics has two parts: the semantics of the statement itself, \ie @and@, @or@, @when@ guards, and @else@ semantics, and the semantics of how the @waituntil@ interacts with types like locks, channels, and futures.
+
+\subsection{Statement Semantics}
+
+The @or@ semantics are the most straightforward and nearly match those laid out in the ALT statement from Occam.
+The clauses have an exclusive-or relationship where the first available one is run and only one clause is run.
+\CFA's @or@ semantics differ from ALT semantics: instead of randomly picking a clause when multiple are available, the first clause in the @waituntil@ that is available is executed.
+For example, in the following example, if @foo@ and @bar@ are both available, @foo@ is always selected since it comes first in the order of @waituntil@ clauses.
+\begin{cfa}
+future(int) bar, foo;
+waituntil( foo ) { ... } or waituntil( bar ) { ... } // prioritize foo
+\end{cfa}
+The reason for these semantics is that prioritizing resources can be useful in certain problems, such as shutdown.
+In the rare case where there is a starvation problem with the ordering, it possible to follow a @waituntil@ with its reverse form, alternating which resource has the highest priority:
+\begin{cfa}
+waituntil( foo ) { ... } or waituntil( bar ) { ... } // prioritize foo
+waituntil( bar ) { ... } or waituntil( foo ) { ... } // prioritize bar
+\end{cfa}
+While this approach is not general for many resources, it handles many basic cases.
+
+\begin{figure}
+\begin{cfa}
+future(int) Future;
+channel(int) Channel;
+owner_lock Lock;
+int i = 0;
+
+waituntil( @Lock@ ) { ... }
+or when( i == 0 ) waituntil( i << @Channel@ ) { ... }
+and waituntil( @Future@ ) { ... }
+or waituntil( @timeout( 1`s )@ ) { ... }
+// else { ... }
+\end{cfa}
+\caption{Example of \CFA's waituntil statement}
+\label{f:wu_example}
+\end{figure}
+
+The \CFA @and@ semantics match the @and@ semantics of \uC \lstinline[language=uC++]{_Select}.
+When multiple clauses are joined by @and@, the @waituntil@ makes a thread wait for all to be available, but still runs the corresponding code blocks \emph{as they become available}.
+When an @and@ clause becomes available, the waiting thread unblocks and runs that clause's code-block, and then the thread waits again for the next available clause or the @waituntil@ statement is now satisfied.
+This semantics allows work to be done in parallel while synchronizing over a set of resources, and furthermore, gives a good reason to use the @and@ operator.
+If the @and@ operator waited for all clauses to be available before running, it is the same as just acquiring those resources consecutively by a sequence of @waituntil@ statements.
+
+As with normal C expressions, the @and@ operator binds more tightly than the @or@.
+To give an @or@ operator higher precedence, parenthesis are used.
+For example, the following @waituntil@ unconditionally waits for @C@ and one of either @A@ or @B@, since the @or@ is given higher precedence via parenthesis.
+\begin{cfa}
+@(@ waituntil( A ) { ... }		// bind tightly to or
+or waituntil( B ) { ... } @)@
+and waituntil( C ) { ... }
+\end{cfa}
+
+The guards in the @waituntil@ statement are called @when@ clauses.
+Each boolean expression inside a @when@ is evaluated \emph{once} before the @waituntil@ statement is run.
+Like Occam's ALT, the guards toggle clauses on and off, where a @waituntil@ clause is only evaluated and waited on if the corresponding guard is @true@.
+In addition, the @waituntil@ guards require some nuance since both @and@ and @or@ operators are supported \see{Section~\ref{s:wu_guards}}.
+When a guard is false and a clause is removed, it can be thought of as removing that clause and its preceding operation from the statement.
+For example, in the following, the two @waituntil@ statements are semantically equivalent.
+
+\begin{lrbox}{\myboxA}
+\begin{cfa}
+when( true ) waituntil( A ) { ... }
+or when( false ) waituntil( B ) { ... }
+and waituntil( C ) { ... }
+\end{cfa}
+\end{lrbox}
+
+\begin{lrbox}{\myboxB}
+\begin{cfa}
+waituntil( A ) { ... }
+and waituntil( C ) { ... }
+
+\end{cfa}
+\end{lrbox}
+
+\begin{tabular}{@{}lcl@{}}
+\usebox\myboxA & $\equiv$ & \usebox\myboxB
+\end{tabular}
+
+The @else@ clause on the @waituntil@ has identical semantics to the @else@ clause in Ada.
+If all resources are not immediately available and there is an @else@ clause, the @else@ clause is run and the thread continues.
+
+\subsection{Type Semantics}
+
+As mentioned, to support interaction with the @waituntil@ statement a type must support the trait in Figure~\ref{f:wu_trait}.
+The @waituntil@ statement expects types to register and unregister themselves via calls to @register_select@ and @unregister_select@, respectively.
+When a resource becomes available, @on_selected@ is run, and if it returns false, the corresponding code block is not run.
+Many types do not need @on_selected@, but it is provided if a type needs to perform work or checks before the resource can be accessed in the code block.
+The register/unregister routines in the trait also return booleans.
+The return value of @register_select@ is @true@, if the resource is immediately available and @false@ otherwise.
+The return value of @unregister_select@ is @true@, if the corresponding code block should be run after unregistration and @false@ otherwise.
+The routine @on_selected@ and the return value of @unregister_select@ are needed to support channels as a resource.
+More detail on channels and their interaction with @waituntil@ appear in Section~\ref{s:wu_chans}.
+
+The trait can be used directly by having a blocking object support the @is_selectable@ trait, or it can be used indirectly through routines that take the object as an argument.
+When used indirectly, the object's routine returns a type that supports the @is_selectable@ trait.
+This feature leverages \CFA's ability to overload on return type to select the correct overloaded routine for the @waituntil@ context.
+Indirect support through routines is needed for types that want to support multiple operations such as channels that allow both reading and writing.
+
+\section{\lstinline{waituntil} Implementation}
+
+The @waituntil@ statement is not inherently complex, and Figure~\ref{f:WU_Impl} shows the basic outline of the @waituntil@ algorithm.
+Complexity comes from the consideration of race conditions and synchronization needed when supporting various primitives.
+The following sections use examples to fill in complexity details missing in Figure~\ref{f:WU_Impl}.
+After which, the full pseudocode for the @waituntil@ algorithm is presented in Figure~\ref{f:WU_Full_Impl}.
+
+\begin{figure}
+\begin{cfa}
+select_nodes s[N];								$\C[3.25in]{// declare N select nodes}$
+for ( node in s )								$\C{// register nodes}$
+	register_select( resource, node );
+while ( statement predicate not satisfied ) {	$\C{// check predicate}$
+	// block until clause(s) satisfied
+	for ( resource in waituntil statement ) {	$\C{// run true code blocks}$
+		if ( resource is avail ) run code block
+		if ( statement predicate is satisfied ) break;
+	}
+}
+for ( node in s )								$\C{// deregister nodes}\CRT$
+	if ( unregister_select( resource, node ) ) run code block
+\end{cfa}
+\caption{\lstinline{waituntil} Implementation}
+\label{f:WU_Impl}
+\end{figure}
+
+The basic steps of the algorithm are:
+\begin{enumerate}
+\item
+The @waituntil@ statement declares $N$ @select_node@s, one per resource that is being waited on, which stores any @waituntil@ data pertaining to that resource.
+
+\item
+Each @select_node@ is then registered with the corresponding resource.
+
+\item
+The thread executing the @waituntil@ then loops until the statement's predicate is satisfied.
+In each iteration, if the predicate is unsatisfied, the @waituntil@ thread blocks.
+When another thread satisfies a resource clause (\eg sends to a  channel), it unblocks the @waituntil@ thread.
+This thread checks all clauses for completion, and any completed clauses have their code blocks run.
+While checking clause completion, if enough clauses have been run such that the statement predicate is satisfied, the loop exits early.
+
+\item
+Once the thread escapes the loop, the @select_nodes@ are unregistered from the resources.
+\end{enumerate}
+These steps give a basic overview of how the statement works.
+The following sections shed light on the specific changes and provide more implementation detail.
+
+\subsection{Locks}\label{s:wu_locks}
+
+The \CFA runtime supports a number of spinning and blocking locks, \eg semaphore, MCS, futex, Go mutex, spinlock, owner, \etc.
+Many of these locks satisfy the @is_selectable@ trait, and hence, are resources supported by the @waituntil@ statement.
+For example, the following waits until the thread has acquired lock @l1@ or locks @l2@ and @l3@.
+\begin{cfa}
+owner_lock l1, l2, l3;
+waituntil ( l1 ) { ... }
+or waituntil( l2 ) { ... }
+and waituntil( l3 ) { ... }
+\end{cfa}
+Implicitly, the @waituntil@ is calling the lock acquire for each of these locks to establish a position in the lock's queue of waiting threads.
+When the lock schedules this thread, it unblocks and runs the code block associated with the lock and then releases the lock.
+
+In detail, when a thread waits on multiple locks via a @waituntil@, it enqueues a @select_node@ in each of the lock's waiting queues.
+When a @select_node@ reaches the front of the lock's queue and gains ownership, the thread blocked on the @waituntil@ is unblocked.
+Now, the lock is held by the @waituntil@ thread until the code block is executed, and then the node is unregistered, during which the lock is released.
+Immediately releasing the lock after the code block prevents the waiting thread from holding multiple locks and potentially introducing a deadlock.
+As such, the only unregistered nodes associated with locks are the ones that have not run.
+
+\subsection{Timeouts}
+
+A timeout for the @waituntil@ statement is a duration passed to routine \lstinline[deletekeywords={timeout}]{timeout}\footnote{\lstinline{timeout} is a quasi-keyword in \CFA, allowing it to be used an identifier.}, \eg:
+\begin{cquote}
+\begin{tabular}{@{}l|l@{}}
+\multicolumn{2}{@{}l@{}}{\lstinline{Duration D1\{ 1`ms \}, D2\{ 2`ms \}, D3\{ 3`ms \};}} \\
+\begin{cfa}[deletekeywords={timeout}]
+waituntil( i << C1 ) {}
+or waituntil( i << C2 ) {}
+or waituntil( i << C3 ) {}
+or waituntil( timeout( D1 ) ) {}
+or waituntil( timeout( D2 ) ) {}
+or waituntil( timeout( D3 ) ) {}
+\end{cfa}
+&
+\begin{cfa}[deletekeywords={timeout}]
+waituntil( i << C1 ) {}
+or waituntil( i << C2 ) {}
+or waituntil( i << C3 ) {}
+or waituntil( timeout( min( D1, D2, D3 ) ) ) {}
+
+
+\end{cfa}
+\end{tabular}
+\end{cquote}
+These two examples are basically equivalent.
+Here, the multiple timeouts are useful because the durations can change during execution and the separate clauses provide different code blocks if a timeout triggers.
+Multiple timeouts can also be used with @and@ to provide a minimal delay before proceeding.
+In following example, either channel @C1@ or @C2@ must be satisfied but nothing can be done for at least 1 or 3 seconds after the channel read, respectively.
+\begin{cfa}[deletekeywords={timeout}]
+waituntil( i << C1 ){} and waituntil( timeout( 1`s ) ){}
+or waituntil( i << C2 ){} and waituntil( timeout( 3`s ) ){}
+\end{cfa}
+If only @C2@ is satisfied, \emph{both} timeout code-blocks trigger because 1 second occurs before 3 seconds.
+Note that the \CFA @waitfor@ statement only provides a single @timeout@ clause because it only supports @or@ semantics.
+
+The \lstinline[deletekeywords={timeout}]{timeout} routine is different from UNIX @sleep@, which blocks for the specified duration and returns the amount of time elapsed since the call started.
+Instead, \lstinline[deletekeywords={timeout}]{timeout} returns a type that supports the @is_selectable@ trait, allowing the type system to select the correct overloaded routine for this context.
+For the @waituntil@, it is more idiomatic for the \lstinline[deletekeywords={timeout}]{timeout} to use the same syntax as other blocking operations instead of having a special language clause.
+
+\subsection{Channels}\label{s:wu_chans}
+
+Channels require more complexity to allow synchronous multiplexing.
+For locks, when an outside thread releases a lock and unblocks the @waituntil@ thread (WUT), the lock's MX property is passed to the WUT (no spinning locks).
+For futures, the outside thread deliveries a value to the future and unblocks any waiting threads, including WUTs.
+In either case, after the WUT unblocks, it is safe to execute its corresponding code block knowing access to the resource is protected by the lock or the read-only state of the future.
+Similarly, for channels, when an outside thread inserts a value into a channel, it must unblock the WUT.
+However, for channels, there is a race condition that poses an issue.
+If the outside thread inserts into the channel and unblocks the WUT, there is a race where another thread can remove the channel data, so after the WUT unblocks and attempts to remove from the buffer, it fails, and the WUT must reblock (busy waiting).
+This scenario is a \gls{toctou} race that needs to be consolidated.
+To close the race, the outside thread must detect this case and insert directly into the left-hand side of the channel expression (@i << chan@) rather than into the channel, and then unblock the WUT.
+Now the unblocked WUT is guaranteed to have a satisfied resource and its code block can safely executed.
+The insertion circumvents the channel buffer via the wait-morphing in the \CFA channel implementation \see{Section~\ref{s:chan_impl}}, allowing @waituntil@ channel unblocking to not be special-cased.
+Note that all channel operations are fair and no preference is given between @waituntil@ and direct channel operations when unblocking.
+
+Furthermore, if both @and@ and @or@ operators are used, the @or@ operations stop behaving like exclusive-or due to the race among channel operations, \eg:
+\begin{cfa}
+int i;
+waituntil( i << A ) {} and waituntil( i << B ) {}
+or waituntil( i << C ) {} and waituntil( i << D ) {}
+\end{cfa}
+If exclusive-or semantics are followed, only the code blocks for @A@ and @B@ are run, or the code blocks for @C@ and @D@.
+However, four outside threads inserting into each channel can simultaneously put values into @i@ and attempt to unblock the WUT to run the four code-blocks.
+This case introduces a race with complexity that increases with the size of the @waituntil@ statement.
+However, due to TOCTOU issues, it is impossible to know if all resources are available without acquiring all the internal locks of channels in the subtree of the @waituntil@ clauses.
+This approach is a poor solution for two reasons.
+It is possible that once all the locks are acquired, the subtree is not satisfied and the locks must be released.
+This work incurs a high cost for signalling threads and heavily increase contention on internal channel locks.
+Furthermore, the @waituntil@ statement is polymorphic and can support resources that do not have internal locks, which also makes this approach infeasible.
+As such, the exclusive-or semantics are lost when using both @and@ and @or@ operators since it cannot be supported without significant complexity and significantly affects @waituntil@ performance.
+Therefore, the example of reusing variable @i@ by multiple output channels is considered a user error without exclusive-or semantics.
+Given aliasing in C, it is impossible to even warn of the potential race.
+In the future, it would be interesting to support Go-like syntax, \lstinline[language=Go]{case i := <- ...}, defining a new scoped @i@ variable for each clause.
+
+It was deemed important that exclusive-or semantics are maintained when only @or@ operators are used, so this situation has been special-cased, and is handled by having all clauses race to set a value \emph{before} operating on the channel.
+Consider the following example where thread 1 is reading and threads 2 and 3 are writing to channels @A@ and @B@ concurrently.
+\begin{cquote}
+\begin{tabular}{@{}l|l|l@{}}
+\multicolumn{3}{@{}l@{}}{\lstinline{channel(int) A, B; // zero size channels}} \\
+thread 1 & thread 2 & thread 3 \\
+\begin{cfa}
+waituntil( i << A ) {}
+or waituntil( i << B ) {}
+\end{cfa}
+&
+\begin{cfa}
+A << 1;
+
+\end{cfa}
+&
+\begin{cfa}
+B << 2;
+
+\end{cfa}
+\end{tabular}
+\end{cquote}
+For thread 1 to have exclusive-or semantics, it must only consume from exactly one of @A@ or @B@.
+As such, thread 2 and 3 must race to establish the winning clause of the @waituntil@ in thread 1.
+This race is consolidated by thread 2 and 3 each attempting to set a pointer to the winning clause's @select_node@ address using \gls{cas}.
+The winner bypasses the channel and inserts into the WUT's left-hand, and signals thread 1.
+The loser continues checking if there is space in the channel, and if so performs the channel insert operation with a possible signal of a waiting remove thread;
+otherwise, if there is no space, the loser blocks.
+It is important the race occurs \emph{before} operating on the channel, because channel actions are different with respect to each thread.
+If the race was consolidated after the operation, both thread 2 and 3 could potentially write into @i@ concurrently.
+
+Channels introduce another interesting implementation issue.
+Supporting both reading and writing to a channel in a @waituntil@ means that one @waituntil@ clause may be the notifier of another @waituntil@ clause.
+This conjunction poses a problem when dealing with the special-cased @or@ where the clauses need to win a race to operate on a channel.
+Consider the following example, alongside a described ordering of events to highlight the race.
+\begin{cquote}
+\begin{tabular}{@{}l|l@{}}
+\multicolumn{2}{@{}l@{}}{\lstinline{channel(int) A, B; // zero size channels}} \\
+thread 1 & thread 2 \\
+\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
+waituntil( @i << A@ ) {}
+or waituntil( #i << B# ) {}
+\end{cfa}
+&
+\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
+waituntil( #B << 2# ) {}
+or waituntil( @A << 1@ ) {}
+\end{cfa}
+\end{tabular}
+\end{cquote}
+Assume thread 1 executes first, registers with channel @A@ and proceeds in the @waituntil@.
+Since @A@ is empty, thread 1 cannot remove, and then thread 1 is interrupted before registering with @B@.
+Thread 2 similarly registers with channel @B@, and proceeds in the @waituntil@.
+Since @B@ is zero size there is no space to insert, and then thread 2 is interrupted before registering with @A@.
+At this point, thread 1 and 2 resume execution.
+Remember from above, each exclusive-or @waituntil@ holds a race to set the winning clause of the statement.
+The issue that arises is that these two @waituntil@ statements must have matching winning clauses (both @A@ clauses or both @B@ clauses) to preserve the exclusive-or semantics, since a zero-sized channel needs an insert/remove pair for an operation to occur.
+If threads 1 and 2 race to set a winner only in their own @waituntil@, thread 1 can think it successfully removed from @B@, and thread 2 can think it successfully inserted into @A@, which is an error.
+Hence, there is a race on two fronts.
+If thread 1 wins the race and sees that @B@ has a waiting insertion, then thread 2 must execute the first clause of its @waituntil@ and thread 1 execute its second.
+Correspondingly, if thread 2 wins the race and sees that @A@ has a waiting removal, then thread 1 must execute the first clause of its @waituntil@ and thread 2 execute its second.
+Any other execution scenario is incorrect for exclusive-or semantics.
+Note that priority execution of multiple satisfied @waituntil@ causes (\ie top to bottom) is not violated because, in this scenario, there is only one satisfied clause for either thread.
+
+The Go @select@ solves this problem by acquiring all the internal locks of the channels before registering the @select@ on the channels.
+This approach eliminates the race shown above since thread 1 and 2 cannot both be registering at the same time.
+However, this approach cannot be used in \CFA, since the @waituntil@ is polymorphic.
+Not all types in a @waituntil@ have an internal lock, and when using non-channel types, acquiring all the locks incurs extra unneeded overhead.
+Instead, this race is consolidated in \CFA in two phases by having an intermediate pending status value for the race.
+This race case is detectable, and if detected, each thread first races to set its own @waituntil@ race pointer to be pending.
+If it succeeds, it then attempts to set the other thread's @waituntil@ race pointer to its success value.
+If either thread successfully sets the the other thread's @waituntil@ race pointer, then the operation can proceed, if not the signalling threads set its own race pointer back to the initial value and repeats.
+This retry mechanism can potentially introduce a livelock, but in practice a livelock here is highly unlikely.
+Furthermore, the likelihood of a livelock here is zero unless the program is in the niche case of having two or more exclusive-or @waituntil@s with two or more clauses in reverse order of priority.
+This livelock case can be fully eliminated using locks like Go, or if a \gls{dcas} instruction is available.
+If any other threads attempt to set a WUT's race pointer and see a pending value, they wait until the value changes before proceeding to ensure that, in the case the WUT fails, the signal is not lost.
+This protocol ensures that signals cannot be lost and that the two races can be resolved in a safe manner.
+The implementation of this protocol is shown in Figure~\ref{f:WU_DeadlockAvoidance}.
+
+\begin{figure}
+\begin{cfa}
+bool pending_set_other( select_node & other, select_node & mine ) {
+	unsigned long int cmp_status = UNSAT;
+
+	// Try to set other status, if we succeed break and return true
+	while( ! CAS( other.clause_status, &cmp_status, SAT ) ) {
+		if ( cmp_status == SAT )
+			return false; // If other status is SAT we lost so return false
+
+		// Toggle own status flag to allow other thread to potentially win
+		mine.status = UNSAT;
+
+		// Reset compare flag
+		cmp_status = UNSAT;
+
+		// Attempt to set own status flag back to PENDING to retry
+		if ( ! CAS( mine.clause_status, &cmp_status, PENDING ) )
+			return false; // If we fail then we lost so return false
+		
+		// Reset compare flag
+		cmp_status = UNSAT;
+	}
+	return true;
+}
+\end{cfa}
+\caption{Exclusive-or \lstinline{waituntil} channel deadlock avoidance protocol}
+\label{f:WU_DeadlockAvoidance}
+\end{figure}
+
+Channels in \CFA have exception-based shutdown mechanisms that the @waituntil@ statement needs to support.
+These exception mechanisms are supported through the @on_selected@ routine.
+This routine is needed by channels to detect if they are closed after unblocking in a @waituntil@ statement, to ensure the appropriate behaviour is taken and an exception is thrown.
+Hence, the channel close-down mechanism is handled correctly.
+
+\subsection{Guards and Statement Predicate}\label{s:wu_guards}
+
+It is trivial to check when a synchronous multiplexing utility is done for the or/xor relationship, since any resource becoming available means that the blocked thread can proceed and the @waituntil@ statement is finished.
+In \uC and \CFA, the \gls{synch_multiplex} mechanism have both an and/or relationship, which along with guards, make the problem of checking for completion of the statement difficult.
+Consider the following @waituntil@.
+\begin{cfa}
+when( GA ) waituntil( A ) {}
+and when( GB ) waituntil( B ) {}
+or when( GC ) waituntil( C ) {}
+\end{cfa}
+When the @waituntil@ thread wakes up, the following predicate represents satisfaction:
+\begin{cfa}
+A && B || C || ! GA && B || ! GB && A || ! GA && ! GB && ! GC
+\end{cfa}
+which can be simplified to:
+\begin{cfa}
+( A || ! GA ) && ( B || ! GB ) || C || ! GA && ! GB && ! GC
+\end{cfa}
+Checking the complete predicate on each iteration of the pending @waituntil@ is expensive so \uC and \CFA both take steps to simplify checking for statement completion.
+
+In the \uC @_Select@ statement, this problem is solved by constructing a tree of the resources, where the internal nodes are operators and the leaves are booleans storing the state of each resource.
+A diagram of the tree for the complete predicate above is shown in Figure~\ref{f:uC_select_tree}, alongside the modification of the tree that occurs when @GA@ is @false@.
+Each internal node stores the statuses of the two subtrees beneath it.
+When resources become available, their corresponding leaf node status is modified, which percolates up the tree to update the state of the statement.
+Once the root of the tree has both subtrees marked as @true@ then the statement is complete.
+As an optimization, when the internal nodes are updated, the subtrees marked as @true@ are pruned and not examined again.
+To support statement guards in \uC, the tree is modified to remove an internal node if a guard is false to maintain the appropriate predicate representation.
+
+\begin{figure}
+\begin{center}
+\input{diagrams/uCpp_select_tree.tikz}
+\end{center}
+\caption{\uC \lstinline[language=uC++]{select} tree modification}
+\label{f:uC_select_tree}
+\end{figure}
+
+The \CFA @waituntil@ statement blocks a thread until a set of resources have become available that satisfy the complete predicate of a @waituntil@.
+The waiting condition of the @waituntil@ statement is implemented as the complete predicate over the resources, joined by the @waituntil@ operators, where a resource is @true@ if it is available, and @false@ otherwise.
+This complete predicate is used as the mechanism to check if a thread is done waiting on a @waituntil@.
+Leveraging the compiler, a predicate routine is generated per @waituntil@.
+This predicate routine accepts the statuses of the resources being waited on as arguments.
+A resource status is an integer that indicates whether a resource is either not available, available, or has already run its associated code block.
+The predicate routine returns @true@ when the @waituntil@ is done, \ie enough resources have run their associated code blocks to satisfy the @waituntil@'s predicate, and false otherwise.
+To support guards on the \CFA @waituntil@ statement, the status of a resource disabled by a guard is set to a boolean value that ensures that the predicate function behaves as if that resource is no longer part of the predicate.
+The generated code allows the predicate that is checked with each iteration to be simplified to not check guard values.
+For example, the following is generated for the complete predicate above:
+\begin{cfa}
+// statement completion predicate
+bool check_completion( select_node * nodes ) {
+	return nodes[0].status && nodes[1].status || nodes[2].status;
+}
+if ( GA || GB || GC ) {				$\C{// skip statement if all guards false}$
+	select_node nodes[3];
+	nodes[0].status = ! GA && GB;	$\C{// A's status}$
+	nodes[1].status = ! GB && GA;	$\C{// B's status}$
+	nodes[2].status = ! GC;			$\C{// C's status}$
+	// ... rest of waituntil codegen ...
+}
+\end{cfa}
+
+\uC's @_Select@, supports operators both inside and outside of the \lstinline[language=uC++]{_Select} clauses.
+In the following example, the code blocks run once their corresponding predicate inside the round braces is satisfied.
+\begin{lstlisting}[language=uC++,{moredelim=**[is][\color{red}]{@}{@}}]
+Future_ISM<int> A, B, C, D;
+_Select( @A || B && C@ ) { ... }
+and _Select( @D && E@ ) { ... }
+\end{lstlisting}
+This feature is more expressive than the @waituntil@ statement in \CFA, allowing the code block for @&&@ to only run after \emph{both} resources are available.
+
+In \CFA, since the @waituntil@ statement supports more resources than just futures, implementing operators inside clauses is avoided for a few reasons.
+As a motivating example, suppose \CFA supported operators inside clauses as in:
+\begin{cfa}
+owner_lock A, B, C, D;
+waituntil( A && B ) { ... }
+or waituntil( C && D ) { ... }
+\end{cfa}
+If the @waituntil@ acquires each lock as it becomes available, there is a possible deadlock since it is in a hold-and-wait situation.
+Other semantics are needed to ensure this operation is safe.
+One possibility is to use \CC's @scoped_lock@ approach described in Section~\ref{s:DeadlockAvoidance};
+however, that opens the potential for livelock.
+Another possibility is to use resource ordering similar to \CFA's @mutex@ statement, but that alone is insufficient, if the resource ordering is not used universally.
+One other way this could be implemented is to wait until all resources for a given clause are available before proceeding to acquire them, but this also quickly becomes a poor approach.
+This approach does not work due to \gls{toctou} issues, \ie it is impossible to ensure that the full set of resources are available without holding them all first.
+Operators inside clauses in \CFA could potentially be implemented with careful circumvention of the problems.
+Finally, the problem of operators inside clauses is also a difficult to handle when supporting channels.
+It would require some way to ensure channels used with internal operators are modified, if and only if, the corresponding code block is run.
+However, that is not feasible due to reasons described in the exclusive-or portion of Section~\ref{s:wu_chans}.
+Ultimately, this feature was not considered crucial after taking into account the complexity and runtime cost.
+
+\subsection{The full \lstinline{waituntil} picture}
+
+Given all the complex details handled by the @waituntil@, its full pseudocode is presented in Figure \ref{f:WU_Full_Impl}.
+Some things to note are as follows.
+The @finally@ blocks provide exception-safe \gls{raii} unregistering of nodes, and in particular, the @finally@ inside the innermost loop performs the immediate unregistering required for deadlock-freedom mentioned in Section~\ref{s:wu_locks}.
+The @when_conditions@ array is used to store the boolean result of evaluating each guard at the beginning of the @waituntil@, and it is used to conditionally omit operations on resources with @false@ guards.
+As discussed in Section~\ref{s:wu_chans}, this pseudocode includes conditional code-block execution based on the result of both @on_selected@ and @unregister_select@, which allows the channel implementation to ensure all available channel resources have their corresponding code block run.
+
+\begin{figure}
+\begin{cfa}
+bool when_conditions[N];
+for ( node in nodes )									$\C[3.75in]{// evaluate guards}$
+	if ( node has guard ) 
+		when_conditions[node] = node_guard;
+	else 
+		when_conditions[node] = true;
+
+if ( any when_conditions[node] are true ) {
+	select_nodes nodes[N];									$\C{// declare N select nodes}$
+	try {
+		// ... set statuses for nodes with when_conditions[node] == false ...
+
+		for ( node in nodes )								$\C{// register nodes}$
+			if ( when_conditions[node] )
+				register_select( resource, node );
+
+		while ( !check_completion( nodes ) ) {	$\C{// check predicate}$
+			// block
+			for ( resource in waituntil statement ) {	$\C{// run true code blocks}$
+				if ( check_completion( nodes ) ) break;
+				if ( resource is avail )
+					try {
+						if( on_selected( resource ) )	$\C{// conditionally run block}$
+							run code block
+					} finally							$\C{// for exception safety}$
+						unregister_select( resource, node ); $\C{// immediate unregister}$
+			}
+		}
+	} finally {											$\C{// for exception safety}$
+		for ( registered nodes in nodes )					$\C{// deregister nodes}$
+			if ( when_conditions[node] && unregister_select( resource, node )
+					&& on_selected( resource ) ) 
+				run code block							$\C{// run code block upon unregister}\CRT$
+	}
+}
+\end{cfa}
+\caption{Full \lstinline{waituntil} Pseudocode Implementation}
+\label{f:WU_Full_Impl}
+\end{figure}
+
+\section{\lstinline{waituntil} Performance}
+
+Similar facilities to @waituntil@ are discussed in Section~\ref{s:History} covering C, Ada, Rust, \CC, and OCaml.
+However, these facilities are either not meaningful or feasible to benchmark against.
+The UNIX @select@ and related utilities are not comparable since they are system calls that go into the kernel and operate on file descriptors, whereas the @waituntil@ exists solely in user space.
+Ada's \lstinline[language=Ada]{select} and \uC's \lstinline[language=uC++]{_Accept} only operate on method calls, which is done in \CFA via the @waitfor@ statement.
+Rust and \CC only offer a busy-wait approach, which is not comparable to a blocking approach.
+OCaml's @select@ waits on channels that are not comparable with \CFA and Go channels, so OCaml @select@ is not benchmarked against Go's @select@ and \CFA's @waituntil@.
+
+The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go \lstinline[language=Go]{select} statement and the \uC \lstinline[language=uC++]{_Select} statement.
+As such, two microbenchmarks are presented, one for Go and one for \uC to contrast this feature.
+Given the differences in features, polymorphism, and expressibility between @waituntil@ and \lstinline[language=Go]{select}, and \uC \lstinline[language=uC++]{_Select}, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
+
+\subsection{Channel Benchmark}
+
+The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
+Although Unix's select, poll and epoll multiplex over file descriptors, which are effectively channels, these utilities are not included in the benchmarks.
+Reading or writing to a file descriptor requires a system call which is much more expensive than operating on a user-space channel.
+As such, it is infeasible to compare Unix's select, poll and epoll with \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}.
+The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
+The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that it waits on.
+Each producer and consumer repeatedly waits to either produce or consume from one of the $C$ clauses and respective channels.
+For example, in \CFA syntax, the work loop in the consumer main with $C = 4$ clauses is:
+\begin{cfa}
+for ()
+	waituntil( val << chans[0] ); or waituntil( val << chans[1] ); 
+	or waituntil( val << chans[2] ); or waituntil( val << chans[3] );
+\end{cfa}
+A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
+The first benchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels using the Go @default@ and \CFA @else@ clause.
+The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
+
+\begin{figure}
+	\centering
+	\captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
+	}
+	\bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
+	}
+	\bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
+	}
+	\caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA \lstinline{waituntil} statement throughput (higher is better).}
+	\label{f:select_contend_bench}
+\end{figure}
+
+\begin{figure}
+	\centering
+	\captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
+	}
+	\bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
+	}
+	\bigskip
+
+	\subfloat[AMD]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
+	}
+	\subfloat[Intel]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
+	}
+	\caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA \lstinline{waituntil} statement throughput (higher is better).}
+	\label{f:select_spin_bench}
+\end{figure}
+
+Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@.
+In the AMD benchmarks (left column), the performance is very similar as the number of cores scale.
+The AMD machine has a high-caching contention cost because of its \emph{chiplet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go.
+Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler.
+Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}.
+In the Intel benchmarks (right column), \CFA performs better than Go as the number of cores scales past 2/4 and as the number of clauses increase.
+This difference is due to Go's acquiring all channel locks when registering and unregistering channels on a \lstinline[language=Go]{select}.
+Go then is holding a lock for every channel, resulting in worse performance as the number of channels increase.
+In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
+This scalability difference is more significant on the Intel machine than the AMD machine since the Intel has lower cache-contention costs.
+
+The Go approach of holding all internal channel-locks in the \lstinline[language=Go]{select} has additional drawbacks.
+There are pathological cases where Go's throughput has significant jitter.
+Consider a producer and consumer thread, @P1@ and @C1@, selecting from both channels @A@ and @B@.
+\begin{cquote}
+\begin{tabular}{@{}ll@{}}
+@P1@ & @C1@ \\
+\begin{cfa}
+waituntil( A << i ); or waituntil( B << i );
+\end{cfa}
+&
+\begin{cfa}
+waituntil( val << A ); or waituntil( val << B );
+\end{cfa}
+\end{tabular}
+\end{cquote}
+Additionally, there is another producer and consumer thread, @P2@ and @C2@, operating solely on @B@.
+\begin{cquote}
+\begin{tabular}{@{}ll@{}}
+@P2@ & @C2@ \\
+\begin{cfa}
+B << val;
+\end{cfa}
+&
+\begin{cfa}
+val << B;
+\end{cfa}
+\end{tabular}
+\end{cquote}
+In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
+Interestingly, this case may not be as pathological as it seems.
+If the set of channels belonging to a \lstinline[language=Go]{select} have channels that overlap with the set of another \lstinline[language=Go]{select}, these statements lose the ability to operate in parallel.
+The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism.
+Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
+The AMD results highlight the worst-case scenario for Go since contention is more costly on this machine than the Intel machine.
+
+\begin{table}[t]
+\centering
+\setlength{\extrarowheight}{2pt}
+\setlength{\tabcolsep}{5pt}
+
+\caption{Throughput (channel operations per second) of \CFA and Go in a pathological case for contention in Go's select implementation}
+\label{t:pathGo}
+\begin{tabular}{r|r|r}
+			& \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c}{Go} \\
+	\hline
+	AMD		& \input{data/nasus_Order} \\
+	\hline
+	Intel	& \input{data/pyke_Order}
+\end{tabular}
+\end{table}
+
+Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
+Go \emph{randomly} selects a clause~\cite{go:select}, but \CFA chooses in the order clauses are listed.
+This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour and even better performance.
+In the previous example, threads @P1@ and @C1@ prioritize channel @A@ in the @waituntil@, which can reduce contention for threads @P2@ and @C2@ accessing channel @B@.
+If \CFA did not have priorities, the performance difference in Table~\ref{t:pathGo} would be significant less due to extra contention on channel @B@.
+
+\subsection{Future Benchmark}
+
+The future benchmark compares \CFA's @waituntil@ with \uC's \lstinline[language=uC++]{_Select}, with both utilities waiting on futures.
+While both statements have very similar semantics, supporting @and@ and @or@ operators, \lstinline[language=uC++]{_Select} can only wait on futures, whereas the @waituntil@ is polymorphic. 
+As such, the underlying implementation of the operators differs between @waituntil@ and \lstinline[language=uC++]{_Select}.
+The @waituntil@ statement checks for statement completion using a predicate function, whereas the \lstinline[language=uC++]{_Select} statement maintains a tree that represents the state of the internal predicate.
+
+This benchmark aims to indirectly measure the impact of various predicates on the performance of the @waituntil@ and \lstinline[language=uC++]{_Select} statements.
+The benchmark is indirect since the performance of futures in \CFA and \uC differ by a significant margin.
+The experiment has a server, which cycles fulfilling three futures, @A@, @B@, and @C@, and a client, which waits for these futures to be fulfilled using four different kinds of predicates given in \CFA:
+\begin{cquote}
+\begin{tabular}{@{}l|l@{}}
+OR & AND \\
+\hline
+\begin{cfa}
+waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+\end{cfa}
+&
+\begin{cfa}
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+and waituntil( C ) { get( C ); }
+\end{cfa}
+\\
+\multicolumn{2}{@{}c@{}}{} \\
+AND-OR & OR-AND \\
+\hline
+\begin{cfa}
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+\end{cfa}
+&
+\begin{cfa}
+@(@ waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); } @)@
+and waituntil( C ) { get( C ); }
+\end{cfa}
+\end{tabular}
+\end{cquote}
+The server and client use a low cost synchronize after each fulfillment, so the server does not race ahead of the client.
+
+Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
+Each pair of bars is marked with the predicate name for that experiment and the value at the top of each bar is the standard deviation.
+In detail, \uC results are lower in all cases due to the performance difference between futures and the more complex \gls{synch_multiplex} implementation.
+However, the bars for both systems have similar height patterns across the experiments.
+The @OR@ column for \CFA is more performant than the other \CFA predicates, due to the special-casing of @waituntil@ statements with only @or@ operators.
+For both \uC and \CFA, the @AND@ experiment is the least performant, which is expected since all three futures need to be fulfilled for each statement completion.
+Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
+Given the differences in semantics and implementation between \uC and \CFA, this test only illustrates the overall costs among the different kinds of predicates.
+
+\begin{figure}
+	\centering
+	\subfloat[AMD Future Synchronization Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
+		\label{f:futureAMD}
+	}
+	\subfloat[Intel Future Synchronization Benchmark]{
+		\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
+		\label{f:futureIntel}
+	}
+	\caption{\CFA \lstinline{waituntil} and \uC \lstinline{_Select} statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
+	\label{f:futurePerf}
+\end{figure}
