Context Navigation

source: doc/theses/colby_parsons_MMAth/text/waituntil.tex @ ba068c0

Last change on this file since ba068c0 was aae9c17, checked in by caparsons <caparson@…>, 14 months ago
edited thesis to incorporate Gregors comments
Property mode set to `100644`
File size: 59.1 KB

Line
1	% ======================================================================
2	% ======================================================================
3	\chapter{Waituntil}\label{s:waituntil}
4	% ======================================================================
5	% ======================================================================
6
7	Consider the following motivating problem.
8	There are $N$ stalls (resources) in a bathroom and there are $M$ people (threads) using the bathroom.
9	Each stall has its own lock since only one person may occupy a stall at a time.
10	Humans solve this problem in the following way.
11	They check if all of the stalls are occupied.
12	If not, they enter and claim an available stall.
13	If they are all occupied, people queue and watch the stalls until one is free, and then enter and lock the stall.
14	This solution can be implemented on a computer easily, if all threads are waiting on all stalls and agree to queue.
15
16	Now the problem is extended.
17	Some stalls are wheelchair accessible and some stalls have gender identification.
18	Each person (thread) may be limited to only one kind of stall or may choose among different kinds of stalls that match their criteria.
19	Immediately, the problem becomes more difficult.
20	A single queue no longer solves the problem.
21	What happens when there is a stall available that the person at the front of the queue cannot choose?
22	The na\"ive solution has each thread spin indefinitely continually checking every matching kind of stall(s) until a suitable one is free.
23	This approach is insufficient since it wastes cycles and results in unfairness among waiting threads as a thread can acquire the first matching stall without regard to the waiting time of other threads.
24	Waiting for the first appropriate stall (resource) that becomes available without spinning is an example of \gls{synch_multiplex}: the ability to wait synchronously for one or more resources based on some selection criteria.
25
26	\section{History of Synchronous Multiplexing}\label{s:History}
27
28	There is a history of tools that provide \gls{synch_multiplex}.
29	Some well known \gls{synch_multiplex} tools include Unix system utilities: @select@~\cite{linux:select}, @poll@~\cite{linux:poll}, and @epoll@~\cite{linux:epoll}, and the @select@ statement provided by Go~\cite{go:selectref}, Ada~\cite[\S~9.7]{Ada16}, and \uC~\cite[\S~3.3.1]{uC++}.
30	The concept and theory surrounding \gls{synch_multiplex} was introduced by Hoare in his 1985 book, Communicating Sequential Processes (CSP)~\cite{Hoare85},
31	\begin{quote}
32	A communication is an event that is described by a pair $c.v$ where $c$ is the name of the channel on which the communication takes place and $v$ is the value of the message which passes.~\cite[p.~113]{Hoare85}
33	\end{quote}
34	The ideas in CSP were implemented by Roscoe and Hoare in the language Occam~\cite{Roscoe88}.
35
36	Both CSP and Occam include the ability to wait for a \Newterm{choice} among receiver channels and \Newterm{guards} to toggle which receives are valid.
37	For example,
38	\begin{cfa}[mathescape]
39	(@G1@(x) $\rightarrow$ P @\|@ @G2@(y) $\rightarrow$ Q )
40	\end{cfa}
41	waits for either channel @x@ or @y@ to have a value, if and only if guards @G1@ and @G2@ are true;
42	if only one guard is true, only one channel receives, and if both guards are false, no receive occurs.
43	% extended CSP with a \gls{synch_multiplex} construct @ALT@, which waits for one resource to be available and then executes a corresponding block of code.
44	In detail, waiting for one resource out of a set of resources can be thought of as a logical exclusive-or over the set of resources.
45	Guards are a conditional operator similar to an @if@, except they apply to the resource being waited on.
46	If a guard is false, then the resource it guards is not in the set of resources being waited on.
47	If all guards are false, the ALT, Occam's \gls{synch_multiplex} statement, does nothing and the thread continues.
48	Guards can be simulated using @if@ statements as shown in~\cite[rule~2.4, p~183]{Roscoe88}
49	\begin{lstlisting}[basicstyle=\rm,mathescape]
50	ALT( $b$ & $g$ $P$, $G$ ) = IF ( $b$ ALT($\,g$ $P$, $G$ ), $\neg\,$b ALT( $G$ ) ) (boolean guard elim).
51	\end{lstlisting}
52	but require $2^N-1$ @if@ statements, where $N$ is the number of guards.
53	The exponential blowup comes from applying rule 2.4 repeatedly, since it works on one guard at a time.
54	Figure~\ref{f:wu_if} shows in \CFA an example of applying rule 2.4 for three guards.
55	Also, notice the additional code duplication for statements @S1@, @S2@, and @S3@.
56
57	\begin{figure}
58	\centering
59	\begin{lrbox}{\myboxA}
60	\begin{cfa}
61	when( G1 )
62	waituntil( R1 ) S1
63	or when( G2 )
64	waituntil( R2 ) S2
65	or when( G3 )
66	waituntil( R3 ) S3
67
68
69
70
71
72
73
74	\end{cfa}
75	\end{lrbox}
76
77	\begin{lrbox}{\myboxB}
78	\begin{cfa}
79	if ( G1 )
80	if ( G2 )
81	if ( G3 ) waituntil( R1 ) S1 or waituntil( R2 ) S2 or waituntil( R3 ) S3
82	else waituntil( R1 ) S1 or waituntil( R2 ) S2
83	else
84	if ( G3 ) waituntil( R1 ) S1 or waituntil( R3 ) S3
85	else waituntil( R1 ) S1
86	else
87	if ( G2 )
88	if ( G3 ) waituntil( R2 ) S2 or waituntil( R3 ) S3
89	else waituntil( R2 ) S2
90	else
91	if ( G3 ) waituntil( R3 ) S3
92	\end{cfa}
93	\end{lrbox}
94
95	\subfloat[Guards]{\label{l:guards}\usebox\myboxA}
96	\hspace*{5pt}
97	\vrule
98	\hspace*{5pt}
99	\subfloat[Simulated Guards]{\label{l:simulated_guards}\usebox\myboxB}
100	\caption{\CFA guard simulated with \lstinline{if} statement.}
101	\label{f:wu_if}
102	\end{figure}
103
104	When discussing \gls{synch_multiplex} implementations, the resource being multiplexed is important.
105	While CSP waits on channels, the earliest known implementation of synch\-ronous multiplexing is Unix's @select@~\cite{linux:select}, multiplexing over file descriptors, which conceptually differ from channels in name only.
106	The @select@ system-call is passed three sets of file descriptors (read, write, exceptional) to wait on and an optional timeout.
107	@select@ blocks until either some subset of file descriptors are available or the timeout expires.
108	All file descriptors that are ready are returned by modifying the argument sets to only contain the ready descriptors.
109
110	This early implementation differs from the theory presented in CSP: when the call from @select@ returns it may provide more than one ready file descriptor.
111	As such, @select@ has logical-or multiplexing semantics, whereas the theory described exclusive-or semantics.
112	It is possible to achieve exclusive-or semantics with @select@ by arbitrarily operating on only one of the returned descriptors.
113	@select@ passes the interest set of file descriptors between application and kernel in the form of a worst-case sized bit-mask, where the worst-case is the largest numbered file descriptor.
114	@poll@ reduces the size of the interest sets changing from a bit mask to a linked data structure, independent of the file-descriptor values.
115	@epoll@ further reduces the data passed per call by keeping the interest set in the kernel, rather than supplying it on every call.
116
117	These early \gls{synch_multiplex} tools interact directly with the operating system and others are used to communicate among processes.
118	Later, \gls{synch_multiplex} started to appear in applications, via programming languages, to support fast multiplexed concurrent communication among threads.
119	An early example of \gls{synch_multiplex} is the @select@ statement in Ada~\cite[\S~9.7]{Ichbiah79}.
120	This @select@ allows a task object, with their own threads, to multiplex over a subset of asynchronous calls to its methods.
121	The Ada @select@ has the same exclusive-or semantics and guards as Occam ALT;
122	however, it multiplexes over methods rather than channels.
123
124	\begin{figure}
125	\begin{lstlisting}[language=ada,literate=,{moredelim={**[is][\color{red}]{@}{@}}}]
126	task type buffer is -- thread type
127	... -- buffer declarations
128	count : integer := 0;
129	begin -- thread starts here
130	loop
131	select
132	@when count < Size@ => -- guard
133	@accept insert( elem : in ElemType )@ do -- method
134	... -- add to buffer
135	count := count + 1;
136	end;
137	-- executed if this accept called
138	or
139	@when count > 0@ => -- guard
140	@accept remove( elem : out ElemType )@ do -- method
141	... --remove and return from buffer via parameter
142	count := count - 1;
143	end;
144	-- executed if this accept called
145	or @delay 10.0@; -- unblock after 10 seconds without call
146	or @else@ -- do not block, cannot appear with delay
147	end select;
148	end loop;
149	end buffer;
150	var buf : buffer; -- create task object and start thread in task body
151	\end{lstlisting}
152	\caption{Ada Bounded Buffer}
153	\label{f:BB_Ada}
154	\end{figure}
155
156	Figure~\ref{f:BB_Ada} shows the outline of a bounded buffer implemented with an Ada task.
157	Note that a task method is associated with the \lstinline[language=ada]{accept} clause of the \lstinline[language=ada]{select} statement, rather than being a separate routine.
158	The thread executing the loop in the task body blocks at the \lstinline[language=ada]{select} until a call occurs to @insert@ or @remove@.
159	Then the appropriate \lstinline[language=ada]{accept} method is run with the called arguments.
160	Hence, the \lstinline[language=ada]{select} statement provides rendezvous points for threads, rather than providing channels with message passing.
161	The \lstinline[language=ada]{select} statement also provides a timeout and @else@ (nonblocking), which changes synchronous multiplexing to asynchronous.
162	Now the thread polls rather than blocks.
163
164	Another example of programming-language \gls{synch_multiplex} is Go using a @select@ statement with channels~\cite{go:selectref}.
165	Figure~\ref{l:BB_Go} shows the outline of a bounded buffer administrator implemented with a Go routine.
166	Here two channels are used for inserting and removing by client producers and consumers, respectively.
167	(The @done@ channel is used to synchronize with the program main.)
168	Go's @select@ has the same exclusive-or semantics as the ALT primitive from Occam and associated code blocks for each clause like ALT and Ada.
169	However, unlike Ada and ALT, Go does not provide guards for the \lstinline[language=go]{case} clauses of the \lstinline[language=go]{select}.
170	As such, the exponential blowup can be seen comparing Go and \uC in Figure~\label{f:AdaMultiplexing}.
171	Go also provides a timeout via a channel and a @default@ clause like Ada @else@ for asynchronous multiplexing.
172
173	\begin{figure}
174	\centering
175
176	\begin{lrbox}{\myboxA}
177	\begin{lstlisting}[language=go,literate=,{moredelim={**[is][\color{red}]{@}{@}}}]
178	insert := make( chan int )
179	remove := make( chan * int )
180	buffer := make( chan int Size )
181	done := make( chan int )
182	count := 0
183	func in_buf( int val ) {
184	buffer <- val
185	count++
186	}
187	func out_buf( int * ptr ) {
188	*ptr := <-buffer
189	count--
190	}
191	func BoundedBuffer {
192	L: for {
193	if ( count < Size && count > 0 ) {
194	select { // wait for message
195	@case i := <- insert: in_buf( i )@
196	@case p := <- remove: out_buf( p )@
197	case <- done: break L
198	}
199	} else if ( count < Size ) {
200	select { // wait for message
201	@case i := <- insert: in_buf( i )@
202	case <- done: break L
203	}
204	} else ( count > 0 ) {
205	select { // wait for message
206	@case p := <- remove: out_buf( p )@
207	case <- done: break L;
208	}
209	}
210	}
211	done <- 0
212	}
213	func main() {
214	go BoundedBuffer() // start administrator
215	}
216	\end{lstlisting}
217	\end{lrbox}
218
219	\begin{lrbox}{\myboxB}
220	\begin{lstlisting}[language=uC++,{moredelim={**[is][\color{red}]{@}{@}}}]
221
222
223
224
225
226
227
228
229
230
231
232
233	_Task BoundedBuffer {
234	int * buffer;
235	int front = back = count = 0;
236	public:
237	// ... constructor implementation
238	void insert( int elem ) {
239	buffer[front] = elem;
240	front = ( front + 1 ) % Size;
241	count += 1;
242	}
243	int remove() {
244	int ret = buffer[back];
245	back = ( back + 1 ) % Size;
246	count -= 1;
247	return ret;
248	}
249	private:
250	void main() {
251	for ( ;; ) {
252	_Accept( ~buffer ) break;
253	@or _When ( count < Size ) _Accept( insert )@;
254	@or _When ( count > 0 ) _Accept( remove )@;
255	}
256	}
257	};
258	buffer buf; // start thread in main method
259	\end{lstlisting}
260	\end{lrbox}
261
262	\subfloat[Go]{\label{l:BB_Go}\usebox\myboxA}
263	\hspace*{5pt}
264	\vrule
265	\hspace*{5pt}
266	\subfloat[\uC]{\label{l:BB_uC++}\usebox\myboxB}
267
268	\caption{Bounded Buffer}
269	\label{f:AdaMultiplexing}
270	\end{figure}
271
272	Finally, \uC provides \gls{synch_multiplex} with Ada-style @select@ over monitor and task methods with the @_Accept@ statement~\cite[\S~2.9.2.1]{uC++}, and over futures with the @_Select@ statement~\cite[\S~3.3.1]{uC++}.
273	The @_Select@ statement extends the ALT/Go @select@ by offering both @and@ and @or@ semantics, which can be used together in the same statement.
274	Both @_Accept@ and @_Select@ statements provide guards for multiplexing clauses, as well as, timeout, and @else@ clauses.
275	Figure~\ref{l:BB_uC++} shows the outline of a bounded buffer administrator implemented with \uC @_Accept@ statements.
276
277	There are other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, Java's @allof@/@anyof@ over futures~\cite{java:allof:anyof}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
278	Note that while C++14 and Rust provide \gls{synch_multiplex}, the implementations leave much to be desired as both rely on polling to wait on multiple resources.
279
280	\section{Other Approaches to Synchronous Multiplexing}
281
282	To avoid the need for \gls{synch_multiplex}, all communication among threads/processes must come from a single source.
283	For example, in Erlang each process has a single heterogeneous mailbox that is the sole source of concurrent communication, removing the need for \gls{synch_multiplex} as there is only one place to wait on resources.
284	Similar, actor systems circumvent the \gls{synch_multiplex} problem as actors only block when waiting for the next message never in a behaviour.
285	While these approaches solve the \gls{synch_multiplex} problem, they introduce other issues.
286	Consider the case where a thread has a single source of communication and it wants a set of $N$ resources.
287	It must sequentially request the $N$ resources and wait for each response.
288	During the receives for the $N$ resources, it can receive other communication, and has to save and postpone these communications, or discard them.
289	% If the requests for the other resources need to be retracted, the burden falls on the programmer to determine how to synchronize appropriately to ensure that only one resource is delivered.
290
291	\section{\CFA's Waituntil Statement}
292
293	The new \CFA \gls{synch_multiplex} utility introduced in this work is the @waituntil@ statement.
294	There already exists a @waitfor@ statement in \CFA that supports Ada-style \gls{synch_multiplex} over monitor methods~\cite{Delisle21}, so this @waituntil@ focuses on synchronizing over other resources.
295	All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only waiting on one kind of resource: Unix @select@ supports file descriptors, Go's @select@ supports channel operations, \uC's @select@ supports futures, and Ada's @select@ supports monitor method calls.
296	The \CFA @waituntil@ is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
297	No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's @waituntil@.
298
299	\begin{figure}
300	\begin{cfa}
301	forall(T & \| sized(T))
302	trait is_selectable {
303	// For registering a waituntil stmt on a selectable type
304	bool register_select( T &, select_node & );
305
306	// For unregistering a waituntil stmt from a selectable type
307	bool unregister_select( T &, select_node & );
308
309	// on_selected is run on the selecting thread prior to executing
310	// the statement associated with the select_node
311	bool on_selected( T &, select_node & );
312	};
313	\end{cfa}
314	\caption{Trait for types that can be passed into \CFA's \lstinline{waituntil} statement.}
315	\label{f:wu_trait}
316	\end{figure}
317
318	Currently locks, channels, futures and timeouts are supported by the @waituntil@ statement, and this set can be expanded through the @is_selectable@ trait as other use-cases arise.
319	The @waituntil@ statement supports guard clauses, both @or@ and @and@ semantics, and timeout and @else@ for asynchronous multiplexing.
320	Figure~\ref{f:wu_example} shows a \CFA @waituntil@ usage, which is waiting for either @Lock@ to be available \emph{or} for a value to be read from @Channel@ into @i@ \emph{and} for @Future@ to be fulfilled \emph{or} a timeout of one second.
321	Note that the expression inside a @waituntil@ clause is evaluated once at the start of the @waituntil@ algorithm.
322
323	\section{Waituntil Semantics}
324
325	The @waituntil@ semantics has two parts: the semantics of the statement itself, \ie @and@, @or@, @when@ guards, and @else@ semantics, and the semantics of how the @waituntil@ interacts with types like locks, channels, and futures.
326
327	\subsection{Statement Semantics}
328
329	The @or@ semantics are the most straightforward and nearly match those laid out in the ALT statement from Occam.
330	The clauses have an exclusive-or relationship where the first available one is run and only one clause is run.
331	\CFA's @or@ semantics differ from ALT semantics: instead of randomly picking a clause when multiple are available, the first clause in the @waituntil@ that is available is executed.
332	For example, in the following example, if @foo@ and @bar@ are both available, @foo@ is always selected since it comes first in the order of @waituntil@ clauses.
333	\begin{cfa}
334	future(int) bar, foo;
335	waituntil( foo ) { ... } or waituntil( bar ) { ... } // prioritize foo
336	\end{cfa}
337	The reason for these semantics is that prioritizing resources can be useful in certain problems, such as shutdown.
338	In the rare case where there is a starvation problem with the ordering, it possible to follow a @waituntil@ with its reverse form, alternating which resource has the highest priority:
339	\begin{cfa}
340	waituntil( foo ) { ... } or waituntil( bar ) { ... } // prioritize foo
341	waituntil( bar ) { ... } or waituntil( foo ) { ... } // prioritize bar
342	\end{cfa}
343	While this approach is not general for many resources, it handles many basic cases.
344
345	\begin{figure}
346	\begin{cfa}
347	future(int) Future;
348	channel(int) Channel;
349	owner_lock Lock;
350	int i = 0;
351
352	waituntil( @Lock@ ) { ... }
353	or when( i == 0 ) waituntil( i << @Channel@ ) { ... }
354	and waituntil( @Future@ ) { ... }
355	or waituntil( @timeout( 1`s )@ ) { ... }
356	// else { ... }
357	\end{cfa}
358	\caption{Example of \CFA's waituntil statement}
359	\label{f:wu_example}
360	\end{figure}
361
362	The \CFA @and@ semantics match the @and@ semantics of \uC \lstinline[language=uC++]{_Select}.
363	When multiple clauses are joined by @and@, the @waituntil@ makes a thread wait for all to be available, but still runs the corresponding code blocks \emph{as they become available}.
364	When an @and@ clause becomes available, the waiting thread unblocks and runs that clause's code-block, and then the thread waits again for the next available clause or the @waituntil@ statement is now satisfied.
365	This semantics allows work to be done in parallel while synchronizing over a set of resources, and furthermore, gives a good reason to use the @and@ operator.
366	If the @and@ operator waited for all clauses to be available before running, it is the same as just acquiring those resources consecutively by a sequence of @waituntil@ statements.
367
368	As with normal C expressions, the @and@ operator binds more tightly than the @or@.
369	To give an @or@ operator higher precedence, parenthesis are used.
370	For example, the following @waituntil@ unconditionally waits for @C@ and one of either @A@ or @B@, since the @or@ is given higher precedence via parenthesis.
371	\begin{cfa}
372	@(@ waituntil( A ) { ... } // bind tightly to or
373	or waituntil( B ) { ... } @)@
374	and waituntil( C ) { ... }
375	\end{cfa}
376
377	The guards in the @waituntil@ statement are called @when@ clauses.
378	Each boolean expression inside a @when@ is evaluated \emph{once} before the @waituntil@ statement is run.
379	Like Occam's ALT, the guards toggle clauses on and off, where a @waituntil@ clause is only evaluated and waited on if the corresponding guard is @true@.
380	In addition, the @waituntil@ guards require some nuance since both @and@ and @or@ operators are supported \see{Section~\ref{s:wu_guards}}.
381	When a guard is false and a clause is removed, it can be thought of as removing that clause and its preceding operation from the statement.
382	For example, in the following, the two @waituntil@ statements are semantically equivalent.
383
384	\begin{lrbox}{\myboxA}
385	\begin{cfa}
386	when( true ) waituntil( A ) { ... }
387	or when( false ) waituntil( B ) { ... }
388	and waituntil( C ) { ... }
389	\end{cfa}
390	\end{lrbox}
391
392	\begin{lrbox}{\myboxB}
393	\begin{cfa}
394	waituntil( A ) { ... }
395	and waituntil( C ) { ... }
396
397	\end{cfa}
398	\end{lrbox}
399
400	\begin{tabular}{@{}lcl@{}}
401	\usebox\myboxA & $\equiv$ & \usebox\myboxB
402	\end{tabular}
403
404	The @else@ clause on the @waituntil@ has identical semantics to the @else@ clause in Ada.
405	If all resources are not immediately available and there is an @else@ clause, the @else@ clause is run and the thread continues.
406
407	\subsection{Type Semantics}
408
409	As mentioned, to support interaction with the @waituntil@ statement a type must support the trait in Figure~\ref{f:wu_trait}.
410	The @waituntil@ statement expects types to register and unregister themselves via calls to @register_select@ and @unregister_select@, respectively.
411	When a resource becomes available, @on_selected@ is run, and if it returns false, the corresponding code block is not run.
412	Many types do not need @on_selected@, but it is provided if a type needs to perform work or checks before the resource can be accessed in the code block.
413	The register/unregister routines in the trait also return booleans.
414	The return value of @register_select@ is @true@, if the resource is immediately available and @false@ otherwise.
415	The return value of @unregister_select@ is @true@, if the corresponding code block should be run after unregistration and @false@ otherwise.
416	The routine @on_selected@ and the return value of @unregister_select@ are needed to support channels as a resource.
417	More detail on channels and their interaction with @waituntil@ appear in Section~\ref{s:wu_chans}.
418
419	The trait can be used directly by having a blocking object support the @is_selectable@ trait, or it can be used indirectly through routines that take the object as an argument.
420	When used indirectly, the object's routine returns a type that supports the @is_selectable@ trait.
421	This feature leverages \CFA's ability to overload on return type to select the correct overloaded routine for the @waituntil@ context.
422	Indirect support through routines is needed for types that want to support multiple operations such as channels that allow both reading and writing.
423
424	\section{\lstinline{waituntil} Implementation}
425
426	The @waituntil@ statement is not inherently complex, and Figure~\ref{f:WU_Impl} shows the basic outline of the @waituntil@ algorithm.
427	Complexity comes from the consideration of race conditions and synchronization needed when supporting various primitives.
428	The following sections use examples to fill in complexity details missing in Figure~\ref{f:WU_Impl}.
429	After which, the full pseudocode for the @waituntil@ algorithm is presented in Figure~\ref{f:WU_Full_Impl}.
430
431	\begin{figure}
432	\begin{cfa}
433	select_nodes s[N]; $\C[3.25in]{// declare N select nodes}$
434	for ( node in s ) $\C{// register nodes}$
435	register_select( resource, node );
436	while ( statement predicate not satisfied ) { $\C{// check predicate}$
437	// block until clause(s) satisfied
438	for ( resource in waituntil statement ) { $\C{// run true code blocks}$
439	if ( resource is avail ) run code block
440	if ( statement predicate is satisfied ) break;
441	}
442	}
443	for ( node in s ) $\C{// deregister nodes}\CRT$
444	if ( unregister_select( resource, node ) ) run code block
445	\end{cfa}
446	\caption{\lstinline{waituntil} Implementation}
447	\label{f:WU_Impl}
448	\end{figure}
449
450	The basic steps of the algorithm are:
451	\begin{enumerate}
452	\item
453	The @waituntil@ statement declares $N$ @select_node@s, one per resource that is being waited on, which stores any @waituntil@ data pertaining to that resource.
454
455	\item
456	Each @select_node@ is then registered with the corresponding resource.
457
458	\item
459	The thread executing the @waituntil@ then loops until the statement's predicate is satisfied.
460	In each iteration, if the predicate is unsatisfied, the @waituntil@ thread blocks.
461	When another thread satisfies a resource clause (\eg sends to a channel), it unblocks the @waituntil@ thread.
462	This thread checks all clauses for completion, and any completed clauses have their code blocks run.
463	While checking clause completion, if enough clauses have been run such that the statement predicate is satisfied, the loop exits early.
464
465	\item
466	Once the thread escapes the loop, the @select_nodes@ are unregistered from the resources.
467	\end{enumerate}
468	These steps give a basic overview of how the statement works.
469	The following sections shed light on the specific changes and provide more implementation detail.
470
471	\subsection{Locks}\label{s:wu_locks}
472
473	The \CFA runtime supports a number of spinning and blocking locks, \eg semaphore, MCS, futex, Go mutex, spinlock, owner, \etc.
474	Many of these locks satisfy the @is_selectable@ trait, and hence, are resources supported by the @waituntil@ statement.
475	For example, the following waits until the thread has acquired lock @l1@ or locks @l2@ and @l3@.
476	\begin{cfa}
477	owner_lock l1, l2, l3;
478	waituntil ( l1 ) { ... }
479	or waituntil( l2 ) { ... }
480	and waituntil( l3 ) { ... }
481	\end{cfa}
482	Implicitly, the @waituntil@ is calling the lock acquire for each of these locks to establish a position in the lock's queue of waiting threads.
483	When the lock schedules this thread, it unblocks and runs the code block associated with the lock and then releases the lock.
484
485	In detail, when a thread waits on multiple locks via a @waituntil@, it enqueues a @select_node@ in each of the lock's waiting queues.
486	When a @select_node@ reaches the front of the lock's queue and gains ownership, the thread blocked on the @waituntil@ is unblocked.
487	Now, the lock is held by the @waituntil@ thread until the code block is executed, and then the node is unregistered, during which the lock is released.
488	Immediately releasing the lock after the code block prevents the waiting thread from holding multiple locks and potentially introducing a deadlock.
489	As such, the only unregistered nodes associated with locks are the ones that have not run.
490
491	\subsection{Timeouts}
492
493	A timeout for the @waituntil@ statement is a duration passed to routine \lstinline[deletekeywords={timeout}]{timeout}\footnote{\lstinline{timeout} is a quasi-keyword in \CFA, allowing it to be used an identifier.}, \eg:
494	\begin{cquote}
495	\begin{tabular}{@{}l\|l@{}}
496	\multicolumn{2}{@{}l@{}}{\lstinline{Duration D1\{ 1`ms \}, D2\{ 2`ms \}, D3\{ 3`ms \};}} \\
497	\begin{cfa}[deletekeywords={timeout}]
498	waituntil( i << C1 ) {}
499	or waituntil( i << C2 ) {}
500	or waituntil( i << C3 ) {}
501	or waituntil( timeout( D1 ) ) {}
502	or waituntil( timeout( D2 ) ) {}
503	or waituntil( timeout( D3 ) ) {}
504	\end{cfa}
505	&
506	\begin{cfa}[deletekeywords={timeout}]
507	waituntil( i << C1 ) {}
508	or waituntil( i << C2 ) {}
509	or waituntil( i << C3 ) {}
510	or waituntil( timeout( min( D1, D2, D3 ) ) ) {}
511
512
513	\end{cfa}
514	\end{tabular}
515	\end{cquote}
516	These two examples are basically equivalent.
517	Here, the multiple timeouts are useful because the durations can change during execution and the separate clauses provide different code blocks if a timeout triggers.
518	Multiple timeouts can also be used with @and@ to provide a minimal delay before proceeding.
519	In following example, either channel @C1@ or @C2@ must be satisfied but nothing can be done for at least 1 or 3 seconds after the channel read, respectively.
520	\begin{cfa}[deletekeywords={timeout}]
521	waituntil( i << C1 ); and waituntil( timeout( 1`s ) );
522	or waituntil( i << C2 ); and waituntil( timeout( 3`s ) );
523	\end{cfa}
524	If only @C2@ is satisfied, \emph{both} timeout code-blocks trigger because 1 second occurs before 3 seconds.
525	Note that the \CFA @waitfor@ statement only provides a single @timeout@ clause because it only supports @or@ semantics.
526
527	The \lstinline[deletekeywords={timeout}]{timeout} routine is different from UNIX @sleep@, which blocks for the specified duration and returns the amount of time elapsed since the call started.
528	Instead, \lstinline[deletekeywords={timeout}]{timeout} returns a type that supports the @is_selectable@ trait, allowing the type system to select the correct overloaded routine for this context.
529	For the @waituntil@, it is more idiomatic for the \lstinline[deletekeywords={timeout}]{timeout} to use the same syntax as other blocking operations instead of having a special language clause.
530
531	\subsection{Channels}\label{s:wu_chans}
532
533	Channels require more complexity to allow synchronous multiplexing.
534	For locks, when an outside thread releases a lock and unblocks the @waituntil@ thread (WUT), the lock's MX property is passed to the WUT (no spinning locks).
535	For futures, the outside thread deliveries a value to the future and unblocks any waiting threads, including WUTs.
536	In either case, after the WUT unblocks, it is safe to execute its corresponding code block knowing access to the resource is protected by the lock or the read-only state of the future.
537	Similarly, for channels, when an outside thread inserts a value into a channel, it must unblock the WUT.
538	However, for channels, there is a race condition that poses an issue.
539	If the outside thread inserts into the channel and unblocks the WUT, there is a race where another thread can remove the channel data, so after the WUT unblocks and attempts to remove from the buffer, it fails, and the WUT must reblock (busy waiting).
540	This scenario is a \gls{toctou} race that needs to be consolidated.
541	To close the race, the outside thread must detect this case and insert directly into the left-hand side of the channel expression (@i << chan@) rather than into the channel, and then unblock the WUT.
542	Now the unblocked WUT is guaranteed to have a satisfied resource and its code block can safely executed.
543	The insertion circumvents the channel buffer via the wait-morphing in the \CFA channel implementation \see{Section~\ref{s:chan_impl}}, allowing @waituntil@ channel unblocking to not be special-cased.
544
545	Furthermore, if both @and@ and @or@ operators are used, the @or@ operations stop behaving like exclusive-or due to the race among channel operations, \eg:
546	\begin{cfa}
547	int i;
548	waituntil( i << A ) {} and waituntil( i << B ) {}
549	or waituntil( i << C ) {} and waituntil( i << D ) {}
550	\end{cfa}
551	If exclusive-or semantics are followed, only the code blocks for @A@ and @B@ are run, or the code blocks for @C@ and @D@.
552	However, four outside threads inserting into each channel can simultaneously put values into @i@ and attempt to unblock the WUT to run the four code-blocks.
553	This case introduces a race with complexity that increases with the size of the @waituntil@ statement.
554	However, due to TOCTOU issues, it is impossible to know if all resources are available without acquiring all the internal locks of channels in the subtree of the @waituntil@ clauses.
555	This approach is a poor solution for two reasons.
556	It is possible that once all the locks are acquired, the subtree is not satisfied and the locks must be released.
557	This work incurs a high cost for signalling threads and heavily increase contention on internal channel locks.
558	Furthermore, the @waituntil@ statement is polymorphic and can support resources that do not have internal locks, which also makes this approach infeasible.
559	As such, the exclusive-or semantics are lost when using both @and@ and @or@ operators since it cannot be supported without significant complexity and significantly affects @waituntil@ performance.
560	Therefore, the example of reusing variable @i@ by multiple output channels is considered a user error without exclusive-or semantics.
561	Given aliasing in C, it is impossible to even warn of the potential race.
562	In the future, it would be interesting to support Go-like syntax, \lstinline[language=Go]{case i := <- ...}, defining a new scoped @i@ variable for each clause.
563
564	It was deemed important that exclusive-or semantics are maintained when only @or@ operators are used, so this situation has been special-cased, and is handled by having all clauses race to set a value \emph{before} operating on the channel.
565	Consider the following example where thread 1 is reading and threads 2 and 3 are writing to channels @A@ and @B@ concurrently.
566	\begin{cquote}
567	\begin{tabular}{@{}l\|l\|l@{}}
568	\multicolumn{3}{@{}l@{}}{\lstinline{channel(int) A, B; // zero size channels}} \\
569	thread 1 & thread 2 & thread 3 \\
570	\begin{cfa}
571	waituntil( i << A ) {}
572	or waituntil( i << B ) {}
573	\end{cfa}
574	&
575	\begin{cfa}
576	A << 1;
577
578	\end{cfa}
579	&
580	\begin{cfa}
581	B << 2;
582
583	\end{cfa}
584	\end{tabular}
585	\end{cquote}
586	For thread 1 to have exclusive-or semantics, it must only consume from exactly one of @A@ or @B@.
587	As such, thread 2 and 3 must race to establish the winning clause of the @waituntil@ in thread 1.
588	This race is consolidated by thread 2 and 3 each attempting to set a pointer to the winning clause's @select_node@ address using \gls{cas}.
589	The winner bypasses the channel and inserts into the WUT's left-hand, and signals thread 1.
590	The loser continues checking if there is space in the channel, and if so performs the channel insert operation with a possible signal of a waiting remove thread;
591	otherwise, if there is no space, the loser blocks.
592	It is important the race occurs \emph{before} operating on the channel, because channel actions are different with respect to each thread.
593	If the race was consolidated after the operation, both thread 2 and 3 could potentially write into @i@ concurrently.
594
595	Channels introduce another interesting implementation issue.
596	Supporting both reading and writing to a channel in a @waituntil@ means that one @waituntil@ clause may be the notifier of another @waituntil@ clause.
597	This conjunction poses a problem when dealing with the special-cased @or@ where the clauses need to win a race to operate on a channel.
598	Consider the following example, alongside a described ordering of events to highlight the race.
599	\begin{cquote}
600	\begin{tabular}{@{}l\|l@{}}
601	\multicolumn{2}{@{}l@{}}{\lstinline{channel(int) A, B; // zero size channels}} \\
602	thread 1 & thread 2 \\
603	\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
604	waituntil( @i << A@ ) {}
605	or waituntil( #i << B# ) {}
606	\end{cfa}
607	&
608	\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
609	waituntil( #B << 2# ) {}
610	or waituntil( @A << 1@ ) {}
611	\end{cfa}
612	\end{tabular}
613	\end{cquote}
614	Assume thread 1 executes first, registers with channel @A@ and proceeds in the @waituntil@.
615	Since @A@ is empty, thread 1 cannot remove, and then thread 1 is interrupted before registering with @B@.
616	Thread 2 similarly registers with channel @B@, and proceeds in the @waituntil@.
617	Since @B@ is zero size there is no space to insert, and then thread 2 is interrupted before registering with @A@.
618	At this point, thread 1 and 2 resume execution.
619	Remember from above, each exclusive-or @waituntil@ holds a race to set the winning clause of the statement.
620	The issue that arises is that these two @waituntil@ statements must have matching winning clauses (both @A@ clauses or both @B@ clauses) to preserve the exclusive-or semantics, since a zero-sized channel needs an insert/remove pair for an operation to occur.
621	If threads 1 and 2 race to set a winner only in their own @waituntil@, thread 1 can think it successfully removed from @B@, and thread 2 can think it successfully inserted into @A@, which is an error.
622	Hence, there is a race on two fronts.
623	If thread 1 wins the race and sees that @B@ has a waiting insertion, then thread 2 must execute the first clause of its @waituntil@ and thread 1 execute its second.
624	Correspondingly, if thread 2 wins the race and sees that @A@ has a waiting removal, then thread 1 must execute the first clause of its @waituntil@ and thread 2 execute its second.
625	Any other execution scenario is incorrect for exclusive-or semantics.
626	Note that priority execution of multiple satisfied @waituntil@ causes (\ie top to bottom) is not violated because, in this scenario, there is only one satisfied clause for either thread.
627
628	The Go @select@ solves this problem by acquiring all the internal locks of the channels before registering the @select@ on the channels.
629	This approach eliminates the race shown above since thread 1 and 2 cannot both be registering at the same time.
630	However, this approach cannot be used in \CFA, since the @waituntil@ is polymorphic.
631	Not all types in a @waituntil@ have an internal lock, and when using non-channel types, acquiring all the locks incurs extra unneeded overhead.
632	Instead, this race is consolidated in \CFA in two phases by having an intermediate pending status value for the race.
633	This race case is detectable, and if detected, each thread first races to set its own @waituntil@ race pointer to be pending.
634	If it succeeds, it then attempts to set the other thread's @waituntil@ race pointer to its success value.
635	If either thread successfully sets the the other thread's @waituntil@ race pointer, then the operation can proceed, if not the signalling threads set its own race pointer back to the initial value and repeats.
636	This retry mechanism can potentially introduce a livelock, but in practice a livelock here is highly unlikely.
637	Furthermore, the likelihood of a livelock here is zero unless the program is in the niche case of having two or more exclusive-or @waituntil@s with two or more clauses in reverse order of priority.
638	This livelock case can be fully eliminated using locks like Go, or if a \gls{dcas} instruction is available.
639	If any other threads attempt to set a WUT's race pointer and see a pending value, they wait until the value changes before proceeding to ensure that, in the case the WUT fails, the signal is not lost.
640	This protocol ensures that signals cannot be lost and that the two races can be resolved in a safe manner.
641	The implementation of this protocol is shown in Figure~\ref{f:WU_DeadlockAvoidance}.
642
643	\begin{figure}
644	\begin{cfa}
645	bool pending_set_other( select_node & other, select_node & mine ) {
646	unsigned long int cmp_status = UNSAT;
647
648	// Try to set other status, if we succeed break and return true
649	while( ! CAS( other.clause_status, &cmp_status, SAT ) ) {
650	if ( cmp_status == SAT )
651	return false; // If other status is SAT we lost so return false
652
653	// Toggle own status flag to allow other thread to potentially win
654	mine.status = UNSAT;
655
656	// Reset compare flag
657	cmp_status = UNSAT;
658
659	// Attempt to set own status flag back to PENDING to retry
660	if ( ! CAS( mine.clause_status, &cmp_status, PENDING ) )
661	return false; // If we fail then we lost so return false
662
663	// Reset compare flag
664	cmp_status = UNSAT;
665	}
666	return true;
667	}
668	\end{cfa}
669	\caption{Exclusive-or \lstinline{waituntil} channel deadlock avoidance protocol}
670	\label{f:WU_DeadlockAvoidance}
671	\end{figure}
672
673	Channels in \CFA have exception-based shutdown mechanisms that the @waituntil@ statement needs to support.
674	These exception mechanisms are supported through the @on_selected@ routine.
675	This routine is needed by channels to detect if they are closed after unblocking in a @waituntil@ statement, to ensure the appropriate behaviour is taken and an exception is thrown.
676	Hence, the channel close-down mechanism is handled correctly.
677
678	\subsection{Guards and Statement Predicate}\label{s:wu_guards}
679
680	It is trivial to check when a synchronous multiplexing utility is done for the or/xor relationship, since any resource becoming available means that the blocked thread can proceed and the @waituntil@ statement is finished.
681	In \uC and \CFA, the \gls{synch_multiplex} mechanism have both an and/or relationship, which along with guards, make the problem of checking for completion of the statement difficult.
682	Consider the following @waituntil@.
683	\begin{cfa}
684	when( GA ) waituntil( A ) {}
685	and when( GB ) waituntil( B ) {}
686	or when( GC ) waituntil( C ) {}
687	\end{cfa}
688	When the @waituntil@ thread wakes up, the following predicate represents satisfaction:
689	\begin{cfa}
690	A && B \|\| C \|\| ! GA && B \|\| ! GB && A \|\| ! GA && ! GB && ! GC
691	\end{cfa}
692	which can be simplified to:
693	\begin{cfa}
694	( A \|\| ! GA ) && ( B \|\| ! GB ) \|\| C \|\| ! GA && ! GB && ! GC
695	\end{cfa}
696	Checking the complete predicate on each iteration of the pending @waituntil@ is expensive so \uC and \CFA both take steps to simplify checking for statement completion.
697
698	In the \uC @_Select@ statement, this problem is solved by constructing a tree of the resources, where the internal nodes are operators and the leaves are booleans storing the state of each resource.
699	A diagram of the tree for the complete predicate above is shown in Figure~\ref{f:uC_select_tree}, alongside the modification of the tree that occurs when @GA@ is @false@.
700	Each internal node stores the statuses of the two subtrees beneath it.
701	When resources become available, their corresponding leaf node status is modified, which percolates up the tree to update the state of the statement.
702	Once the root of the tree has both subtrees marked as @true@ then the statement is complete.
703	As an optimization, when the internal nodes are updated, the subtrees marked as @true@ are pruned and not examined again.
704	To support statement guards in \uC, the tree is modified to remove an internal node if a guard is false to maintain the appropriate predicate representation.
705
706	\begin{figure}
707	\begin{center}
708	\input{diagrams/uCpp_select_tree.tikz}
709	\end{center}
710	\caption{\uC \lstinline[language=uC++]{select} tree modification}
711	\label{f:uC_select_tree}
712	\end{figure}
713
714	The \CFA @waituntil@ statement blocks a thread until a set of resources have become available that satisfy the complete predicate of a @waituntil@.
715	The waiting condition of the @waituntil@ statement is implemented as the complete predicate over the resources, joined by the @waituntil@ operators, where a resource is @true@ if it is available, and @false@ otherwise.
716	This complete predicate is used as the mechanism to check if a thread is done waiting on a @waituntil@.
717	Leveraging the compiler, a predicate routine is generated per @waituntil@.
718	This predicate routine accepts the statuses of the resources being waited on as arguments.
719	A resource status is an integer that indicates whether a resource is either not available, available, or has already run its associated code block.
720	The predicate routine returns @true@ when the @waituntil@ is done, \ie enough resources have run their associated code blocks to satisfy the @waituntil@'s predicate, and false otherwise.
721	To support guards on the \CFA @waituntil@ statement, the status of a resource disabled by a guard is set to a boolean value that ensures that the predicate function behaves as if that resource is no longer part of the predicate.
722	The generated code allows the predicate that is checked with each iteration to be simplified to not check guard values.
723	For example, the following is generated for the complete predicate above:
724	\begin{cfa}
725	// statement completion predicate
726	bool check_completion( select_node * nodes ) {
727	return nodes[0].status && nodes[1].status \|\| nodes[2].status;
728	}
729	if ( GA \|\| GB \|\| GC ) { $\C{// skip statement if all guards false}$
730	select_node nodes[3];
731	nodes[0].status = ! GA && GB; $\C{// A's status}$
732	nodes[1].status = ! GB && GA; $\C{// B's status}$
733	nodes[2].status = ! GC; $\C{// C's status}$
734	// ... rest of waituntil codegen ...
735	}
736	\end{cfa}
737
738	\uC's @_Select@, supports operators both inside and outside of the \lstinline[language=uC++]{_Select} clauses.
739	In the following example, the code blocks run once their corresponding predicate inside the round braces is satisfied.
740	\begin{lstlisting}[language=uC++,{moredelim=**[is][\color{red}]{@}{@}}]
741	Future_ISM<int> A, B, C, D;
742	_Select( @A \|\| B && C@ ) { ... }
743	and _Select( @D && E@ ) { ... }
744	\end{lstlisting}
745	This feature is more expressive than the @waituntil@ statement in \CFA, allowing the code block for @&&@ to only run after \emph{both} resources are available.
746
747	In \CFA, since the @waituntil@ statement supports more resources than just futures, implementing operators inside clauses is avoided for a few reasons.
748	As a motivating example, suppose \CFA supported operators inside clauses as in:
749	\begin{cfa}
750	owner_lock A, B, C, D;
751	waituntil( A && B ) { ... }
752	or waituntil( C && D ) { ... }
753	\end{cfa}
754	If the @waituntil@ acquires each lock as it becomes available, there is a possible deadlock since it is in a hold-and-wait situation.
755	Other semantics are needed to ensure this operation is safe.
756	One possibility is to use \CC's @scoped_lock@ approach described in Section~\ref{s:DeadlockAvoidance};
757	however, that opens the potential for livelock.
758	Another possibility is to use resource ordering similar to \CFA's @mutex@ statement, but that alone is insufficient, if the resource ordering is not used universally.
759	One other way this could be implemented is to wait until all resources for a given clause are available before proceeding to acquire them, but this also quickly becomes a poor approach.
760	This approach does not work due to \gls{toctou} issues, \ie it is impossible to ensure that the full set of resources are available without holding them all first.
761	Operators inside clauses in \CFA could potentially be implemented with careful circumvention of the problems.
762	Finally, the problem of operators inside clauses is also a difficult to handle when supporting channels.
763	It would require some way to ensure channels used with internal operators are modified, if and only if, the corresponding code block is run.
764	However, that is not feasible due to reasons described in the exclusive-or portion of Section~\ref{s:wu_chans}.
765	Ultimately, this feature was not considered crucial after taking into account the complexity and runtime cost.
766
767	\subsection{The full \lstinline{waituntil} picture}
768
769	Given all the complex details handled by the @waituntil@, its full pseudocode is presented in Figure \ref{f:WU_Full_Impl}.
770	Some things to note are as follows.
771	The @finally@ blocks provide exception-safe \gls{raii} unregistering of nodes, and in particular, the @finally@ inside the innermost loop performs the immediate unregistering required for deadlock-freedom mentioned in Section~\ref{s:wu_locks}.
772	The @when_conditions@ array is used to store the boolean result of evaluating each guard at the beginning of the @waituntil@, and it is used to conditionally omit operations on resources with @false@ guards.
773	As discussed in Section~\ref{s:wu_chans}, this pseudocode includes conditional code-block execution based on the result of both @on_selected@ and @unregister_select@, which allows the channel implementation to ensure all available channel resources have their corresponding code block run.
774
775	\begin{figure}
776	\begin{cfa}
777	bool when_conditions[N];
778	for ( node in nodes ) $\C[3.75in]{// evaluate guards}$
779	if ( node has guard )
780	when_conditions[node] = node_guard;
781	else
782	when_conditions[node] = true;
783
784	if ( any when_conditions[node] are true ) {
785	select_nodes nodes[N]; $\C{// declare N select nodes}$
786	try {
787	// ... set statuses for nodes with when_conditions[node] == false ...
788
789	for ( node in nodes ) $\C{// register nodes}$
790	if ( when_conditions[node] )
791	register_select( resource, node );
792
793	while ( !check_completion( nodes ) ) { $\C{// check predicate}$
794	// block
795	for ( resource in waituntil statement ) { $\C{// run true code blocks}$
796	if ( check_completion( nodes ) ) break;
797	if ( resource is avail )
798	try {
799	if( on_selected( resource ) ) $\C{// conditionally run block}$
800	run code block
801	} finally $\C{// for exception safety}$
802	unregister_select( resource, node ); $\C{// immediate unregister}$
803	}
804	}
805	} finally { $\C{// for exception safety}$
806	for ( registered nodes in nodes ) $\C{// deregister nodes}$
807	if ( when_conditions[node] && unregister_select( resource, node )
808	&& on_selected( resource ) )
809	run code block $\C{// run code block upon unregister}\CRT$
810	}
811	}
812	\end{cfa}
813	\caption{Full \lstinline{waituntil} Pseudocode Implementation}
814	\label{f:WU_Full_Impl}
815	\end{figure}
816
817	\section{\lstinline{waituntil} Performance}
818
819	Similar facilities to @waituntil@ are discussed in Section~\ref{s:History} covering C, Ada, Rust, \CC, and OCaml.
820	However, these facilities are either not meaningful or feasible to benchmark against.
821	The UNIX @select@ and related utilities are not comparable since they are system calls that go into the kernel and operate on file descriptors, whereas the @waituntil@ exists solely in user space.
822	Ada's \lstinline[language=Ada]{select} and \uC's \lstinline[language=uC++]{_Accept} only operate on method calls, which is done in \CFA via the @waitfor@ statement.
823	Rust and \CC only offer a busy-wait approach, which is not comparable to a blocking approach.
824	OCaml's @select@ waits on channels that are not comparable with \CFA and Go channels, so OCaml @select@ is not benchmarked against Go's @select@ and \CFA's @waituntil@.
825
826	The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go \lstinline[language=Go]{select} statement and the \uC \lstinline[language=uC++]{_Select} statement.
827	As such, two microbenchmarks are presented, one for Go and one for \uC to contrast this feature.
828	Given the differences in features, polymorphism, and expressibility between @waituntil@ and \lstinline[language=Go]{select}, and \uC \lstinline[language=uC++]{_Select}, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
829
830	\subsection{Channel Benchmark}
831
832	The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
833	Although Unix's select, poll and epoll multiplex over file descriptors, which are effectively channels, these utilities are not included in the benchmarks.
834	Reading or writing to a file descriptor requires a system call which is much more expensive than operating on a user-space channel.
835	As such, it is infeasible to compare Unix's select, poll and epoll with \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}.
836	The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
837	The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that it waits on.
838	Each producer and consumer repeatedly waits to either produce or consume from one of the $C$ clauses and respective channels.
839	For example, in \CFA syntax, the work loop in the consumer main with $C = 4$ clauses is:
840	\begin{cfa}
841	for ()
842	waituntil( val << chans[0] ); or waituntil( val << chans[1] );
843	or waituntil( val << chans[2] ); or waituntil( val << chans[3] );
844	\end{cfa}
845	A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
846	The first benchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels using the Go @default@ and \CFA @else@ clause.
847	The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
848
849	\begin{figure}
850	\centering
851	\captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
852	\subfloat[AMD]{
853	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
854	}
855	\subfloat[Intel]{
856	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
857	}
858	\bigskip
859
860	\subfloat[AMD]{
861	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
862	}
863	\subfloat[Intel]{
864	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
865	}
866	\bigskip
867
868	\subfloat[AMD]{
869	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
870	}
871	\subfloat[Intel]{
872	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
873	}
874	\caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA \lstinline{waituntil} statement throughput (higher is better).}
875	\label{f:select_contend_bench}
876	\end{figure}
877
878	\begin{figure}
879	\centering
880	\captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
881	\subfloat[AMD]{
882	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
883	}
884	\subfloat[Intel]{
885	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
886	}
887	\bigskip
888
889	\subfloat[AMD]{
890	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
891	}
892	\subfloat[Intel]{
893	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
894	}
895	\bigskip
896
897	\subfloat[AMD]{
898	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
899	}
900	\subfloat[Intel]{
901	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
902	}
903	\caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA \lstinline{waituntil} statement throughput (higher is better).}
904	\label{f:select_spin_bench}
905	\end{figure}
906
907	Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@.
908	In the AMD benchmarks (left column), the performance is very similar as the number of cores scale.
909	The AMD machine has a high-caching contention cost because of its \emph{chiplet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go.
910	Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler.
911	Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}.
912	In the Intel benchmarks (right column), \CFA performs better than Go as the number of cores scales past 2/4 and as the number of clauses increase.
913	This difference is due to Go's acquiring all channel locks when registering and unregistering channels on a \lstinline[language=Go]{select}.
914	Go then is holding a lock for every channel, resulting in worse performance as the number of channels increase.
915	In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
916	This scalability difference is more significant on the Intel machine than the AMD machine since the Intel has lower cache-contention costs.
917
918	The Go approach of holding all internal channel-locks in the \lstinline[language=Go]{select} has additional drawbacks.
919	There are pathological cases where Go's throughput has significant jitter.
920	Consider a producer and consumer thread, @P1@ and @C1@, selecting from both channels @A@ and @B@.
921	\begin{cquote}
922	\begin{tabular}{@{}ll@{}}
923	@P1@ & @C1@ \\
924	\begin{cfa}
925	waituntil( A << i ); or waituntil( B << i );
926	\end{cfa}
927	&
928	\begin{cfa}
929	waituntil( val << A ); or waituntil( val << B );
930	\end{cfa}
931	\end{tabular}
932	\end{cquote}
933	Additionally, there is another producer and consumer thread, @P2@ and @C2@, operating solely on @B@.
934	\begin{cquote}
935	\begin{tabular}{@{}ll@{}}
936	@P2@ & @C2@ \\
937	\begin{cfa}
938	B << val;
939	\end{cfa}
940	&
941	\begin{cfa}
942	val << B;
943	\end{cfa}
944	\end{tabular}
945	\end{cquote}
946	In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
947	Interestingly, this case may not be as pathological as it seems.
948	If the set of channels belonging to a \lstinline[language=Go]{select} have channels that overlap with the set of another \lstinline[language=Go]{select}, these statements lose the ability to operate in parallel.
949	The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism.
950	Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
951	The AMD results highlight the worst-case scenario for Go since contention is more costly on this machine than the Intel machine.
952
953	\begin{table}[t]
954	\centering
955	\setlength{\extrarowheight}{2pt}
956	\setlength{\tabcolsep}{5pt}
957
958	\caption{Throughput (channel operations per second) of \CFA and Go in a pathological case for contention in Go's select implementation}
959	\label{t:pathGo}
960	\begin{tabular}{r\|r\|r}
961	& \multicolumn{1}{c\|}{\CFA} & \multicolumn{1}{c}{Go} \\
962	\hline
963	AMD & \input{data/nasus_Order} \\
964	\hline
965	Intel & \input{data/pyke_Order}
966	\end{tabular}
967	\end{table}
968
969	Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
970	Go \emph{randomly} selects a clause~\cite{go:select}, but \CFA chooses in the order clauses are listed.
971	This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour and even better performance.
972	In the previous example, threads @P1@ and @C1@ prioritize channel @A@ in the @waituntil@, which can reduce contention for threads @P2@ and @C2@ accessing channel @B@.
973	If \CFA did not have priorities, the performance difference in Table~\ref{t:pathGo} would be significant less due to extra contention on channel @B@.
974
975	\subsection{Future Benchmark}
976
977	The future benchmark compares \CFA's @waituntil@ with \uC's \lstinline[language=uC++]{_Select}, with both utilities waiting on futures.
978	While both statements have very similar semantics, supporting @and@ and @or@ operators, \lstinline[language=uC++]{_Select} can only wait on futures, whereas the @waituntil@ is polymorphic.
979	As such, the underlying implementation of the operators differs between @waituntil@ and \lstinline[language=uC++]{_Select}.
980	The @waituntil@ statement checks for statement completion using a predicate function, whereas the \lstinline[language=uC++]{_Select} statement maintains a tree that represents the state of the internal predicate.
981
982	This benchmark aims to indirectly measure the impact of various predicates on the performance of the @waituntil@ and \lstinline[language=uC++]{_Select} statements.
983	The benchmark is indirect since the performance of futures in \CFA and \uC differ by a significant margin.
984	The experiment has a server, which cycles fulfilling three futures, @A@, @B@, and @C@, and a client, which waits for these futures to be fulfilled using four different kinds of predicates given in \CFA:
985	\begin{cquote}
986	\begin{tabular}{@{}l\|l@{}}
987	OR & AND \\
988	\hline
989	\begin{cfa}
990	waituntil( A ) { get( A ); }
991	or waituntil( B ) { get( B ); }
992	or waituntil( C ) { get( C ); }
993	\end{cfa}
994	&
995	\begin{cfa}
996	waituntil( A ) { get( A ); }
997	and waituntil( B ) { get( B ); }
998	and waituntil( C ) { get( C ); }
999	\end{cfa}
1000	\\
1001	\multicolumn{2}{@{}c@{}}{} \\
1002	AND-OR & OR-AND \\
1003	\hline
1004	\begin{cfa}
1005	waituntil( A ) { get( A ); }
1006	and waituntil( B ) { get( B ); }
1007	or waituntil( C ) { get( C ); }
1008	\end{cfa}
1009	&
1010	\begin{cfa}
1011	@(@ waituntil( A ) { get( A ); }
1012	or waituntil( B ) { get( B ); } @)@
1013	and waituntil( C ) { get( C ); }
1014	\end{cfa}
1015	\end{tabular}
1016	\end{cquote}
1017	The server and client use a low cost synchronize after each fulfillment, so the server does not race ahead of the client.
1018
1019	Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
1020	Each pair of bars is marked with the predicate name for that experiment and the value at the top of each bar is the standard deviation.
1021	In detail, \uC results are lower in all cases due to the performance difference between futures and the more complex \gls{synch_multiplex} implementation.
1022	However, the bars for both systems have similar height patterns across the experiments.
1023	The @OR@ column for \CFA is more performant than the other \CFA predicates, due to the special-casing of @waituntil@ statements with only @or@ operators.
1024	For both \uC and \CFA, the @AND@ experiment is the least performant, which is expected since all three futures need to be fulfilled for each statement completion.
1025	Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
1026	Given the differences in semantics and implementation between \uC and \CFA, this test only illustrates the overall costs among the different kinds of predicates.
1027
1028	\begin{figure}
1029	\centering
1030	\subfloat[AMD Future Synchronization Benchmark]{
1031	\resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
1032	\label{f:futureAMD}
1033	}
1034	\subfloat[Intel Future Synchronization Benchmark]{
1035	\resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
1036	\label{f:futureIntel}
1037	}
1038	\caption{\CFA \lstinline{waituntil} and \uC \lstinline{_Select} statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
1039	\label{f:futurePerf}
1040	\end{figure}

Note: See TracBrowser for help on using the repository browser.

Download in other formats: