Context Navigation

← Previous Change
Next Change →

waituntil.tex

Timestamp:

Jun 27, 2023, 4:45:40 PM (12 months ago)

Author:

caparsons <caparson@…>

Branches:

master

Children:

a1f0cb6

Parents:

917e1fd

Message:

first draft of full waituntil chapter and conclusion chapter. Lots of graph/plotting utilities cleanup. Reran all CFA actor benchmarks after recent changes. Small changes to actor.tex in performance section

File:

: 1 edited

doc/theses/colby_parsons_MMAth/text/waituntil.tex (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/theses/colby_parsons_MMAth/text/waituntil.tex

-                      r917e1fd
+                      r14e1053
 This enables fully expressive \gls{synch_multiplex} predicates.
 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
+There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
 Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources.
 …
 All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls.
 The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
+No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil.
+All others them tie themselves to some specific type of resource.
 \begin{figure}
 …
 \subsection{Channel Benchmark}
+The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
+%C_TODO explain benchmark
+%C_TODO show results
+%C_TODO discuss results
+The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
+The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads.
+The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses.
+Each clause has a respective channel that is operates on.
+Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels.
+An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows.
+\begin{cfa}
+    for (;;)
+        waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {}
+        or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {}
+\end{cfa}
+A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
+The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
+The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
+\begin{figure}
+        \centering
+    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
+        }
+    \bigskip
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
+        }
+    \bigskip
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
+        }
+        \caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
+        \label{f:select_contend_bench}
+\end{figure}
+\begin{figure}
+        \centering
+    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
+        }
+    \bigskip
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
+        }
+    \bigskip
+        \subfloat[AMD]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
+        }
+        \subfloat[Intel]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
+        }
+        \caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
+        \label{f:select_spin_bench}
+\end{figure}
+Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
+In the AMD benchmarks, the performance is very similar as the number of cores scale.
+The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
+At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
+Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}.
+In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
+This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
+Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
+In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
+This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
+The Go approach of holding all internal channel locks in the select has some additional drawbacks.
+This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
+Consider the case where there are two channels, @A@ and @B@.
+There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
+Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
+Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
+This case may not be as pathological as it may seem.
+If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
+The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
+Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
+The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
+\begin{table}[t]
+\centering
+\setlength{\extrarowheight}{2pt}
+\setlength{\tabcolsep}{5pt}
+\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
+\label{t:pathGo}
+\begin{tabular}{*{5}{r|}r}
+    & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
+    \hline
+    AMD         & \input{data/nasus_Order} \\
+    \hline
+    Intel       & \input{data/pyke_Order}
+\end{tabular}
+\end{table}
+Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
+Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
+This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
+If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
 \subsection{Future Benchmark}
 The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures.
+%C_TODO explain benchmark
+%C_TODO show results
+%C_TODO discuss results
+Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic.
+They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
+The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
+\begin{figure}
+        \centering
+        \subfloat[AMD Future Synchronization Benchmark]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
+                \label{f:futureAMD}
+        }
+        \subfloat[Intel Future Synchronization Benchmark]{
+                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
+                \label{f:futureIntel}
+        }
+        \caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
+    \caption{}
+        \label{f:futurePerf}
+\end{figure}
+This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
+This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
+Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
+Each set of columns is marked with a name representing the predicate for that set of columns.
+The predicate name and corresponding waituntil statement is shown below:
+\begin{cfa}
+#ifdef OR
+waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+#endif
+#ifdef AND
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+and waituntil( C ) { get( C ); }
+#endif
+#ifdef ANDOR
+waituntil( A ) { get( A ); }
+and waituntil( B ) { get( B ); }
+or waituntil( C ) { get( C ); }
+#endif
+#ifdef ORAND
+(waituntil( A ) { get( A ); }
+or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
+and waituntil( C ) { get( C ); }
+#endif
+\end{cfa}
+In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators.
+For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
+Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 14e1053 for doc/theses/colby_parsons_MMAth/text/waituntil.tex

Legend:

doc/theses/colby_parsons_MMAth/text/waituntil.tex

Download in other formats: