Ignore:
Timestamp:
Jul 21, 2023, 11:57:24 AM (11 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
master
Children:
c1b6bc6
Parents:
d4c1d1f
Message:

more proofreading of waituntil chapter

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/waituntil.tex

    rd4c1d1f r1ae3f185  
    2424Waiting for the first appropriate stall (resource) that becomes available without spinning is an example of \gls{synch_multiplex}: the ability to wait synchronously for one or more resources based on some selection criteria.
    2525
    26 \section{History of Synchronous Multiplexing}
     26\section{History of Synchronous Multiplexing}\label{s:History}
     27
    2728There is a history of tools that provide \gls{synch_multiplex}.
    2829Some well known \gls{synch_multiplex} tools include Unix system utilities: @select@~\cite{linux:select}, @poll@~\cite{linux:poll}, and @epoll@~\cite{linux:epoll}, and the @select@ statement provided by Go~\cite{go:selectref}, Ada~\cite[\S~9.7]{Ada16}, and \uC~\cite[\S~3.3.1]{uC++}.
     
    560561\multicolumn{2}{@{}l@{}}{\lstinline{channel A, B; // zero size channels}} \\
    561562thread 1 & thread 2 \\
    562 \begin{cfa}
     563\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
    563564waituntil( @i << A@ ) {}
    564 or waituntil( @i << B@ ) {}
     565or waituntil( #i << B# ) {}
    565566\end{cfa}
    566567&
    567 \begin{cfa}
    568 waituntil( @B << 2}@ ) {}
     568\begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}]
     569waituntil( #B << 2# ) {}
    569570or waituntil( @A << 1@ ) {}
    570571\end{cfa}
     
    620621In the following example, the code blocks run once their corresponding predicate inside the round braces is satisfied.
    621622% C_TODO put this is uC++ code style not cfa-style
    622 \begin{lstlisting}[language=uC++]
     623\begin{lstlisting}[language=uC++,{moredelim=**[is][\color{red}]{@}{@}}]
    623624Future_ISM<int> A, B, C, D;
    624 _Select( A || B && C ) { ... }
    625 and _Select( D && E ) { ... }
     625_Select( @A || B && C@ ) { ... }
     626and _Select( @D && E@ ) { ... }
    626627\end{lstlisting}
    627628This is more expressive that the @waituntil@ statement in \CFA, allowing the code block for @&&@ to only run after both resources are available.
     
    699700\section{Waituntil Performance}
    700701
    701 Similar facilities to @waituntil@ are discussed at the start of this chapter in C, Ada, Rust, \CC, and OCaml.
     702Similar facilities to @waituntil@ are discussed in Section~\ref{s:History} covering C, Ada, Rust, \CC, and OCaml.
    702703However, these facilities are either not meaningful or feasible to benchmark against.
    703704The UNIX @select@ and related utilities are not comparable since they are system calls that go into the kernel and operate on file descriptors, whereas the @waituntil@ exists solely in user space.
    704 Ada's @select@ only operates on methods, which is done in \CFA via the @waitfor@ utility so it is not meaningful to benchmark against the @waituntil@, which cannot wait on the same resource.
    705 Rust and \CC only offer a busy-wait based approach, which is not comparable to a blocking approach.
     705Ada's \lstinline[language=Ada]{select} and \uC's \lstinline[language=uC++]{_Accept} only operates on method calls, which is done in \CFA via the @waitfor@ statement, so it is not meaningful to benchmark against the @waituntil@, which cannot wait on this resource.
     706Rust and \CC only offer a busy-wait approach, which is not comparable to a blocking approach.
    706707OCaml's @select@ waits on channels that are not comparable with \CFA and Go channels, so OCaml @select@ is not benchmarked against Go's @select@ and \CFA's @waituntil@.
    707708
    708 The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go @select@ statement and the \uC @_Select@ statement.
    709 As such, two microbenchmarks are presented, one for Go and one for \uC to contrast the systems.
    710 Given the differences in features, polymorphism, and expressibility between @waituntil@ and @select@, and @_Select@, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
     709The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go \lstinline[language=Go]{select} statement and the \uC \lstinline[language=uC++]{_Select} statement.
     710As such, two microbenchmarks are presented, one for Go and one for \uC to contrast this feature.
     711Given the differences in features, polymorphism, and expressibility between @waituntil@ and \lstinline[language=Go]{select}, and \uC \lstinline[language=uC++]{_Select}, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.
    711712
    712713\subsection{Channel Benchmark}
    713714
    714 The channel multiplexing microbenchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
    715 The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
    716 The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has different channel that is waits on.
     715The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.
     716The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.
     717The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that is waits on.
    717718Each producer and consumer repeatedly waits to either produce or consume from one of the $C$ clauses and respective channels.
    718719For example, in \CFA syntax, the work loop in the consumer main with $C = 4$ clauses is:
     
    723724\end{cfa}
    724725A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
    725 The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
     726The first benchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels using the Go @default@ and \CFA @else@ clause.
    726727The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
    727728
     
    784785\end{figure}
    785786
    786 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
    787 In the AMD benchmarks, the performance is very similar as the number of cores scale.
    788 The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
    789 At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
    790 Go heavily optimizes thread handoffs on their local run-queue, which can result in very good performance for low numbers of threads which are parking/unparking each other~\cite{go:sched}.
    791 In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
    792 This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
    793 Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
     787Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@.
     788In the AMD benchmarks (left column), the performance is very similar as the number of cores scale.
     789The AMD machine has a high-caching contention cost because of its \emph{chicklet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go.
     790Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler.
     791Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}.
     792In the Intel benchmarks (right column), \CFA performs better than Go as the number of cores scales past 2/4 and as the number of clauses increase.
     793This difference is due to Go's acquiring all channel locks when registering and unregistering channels on a \lstinline[language=Go]{select}.
     794Go then is holding a lock for every channel, resulting in worse performance as the number of channels increase.
    794795In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
    795 This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
    796 
    797 The Go approach of holding all internal channel locks in the select has some additional drawbacks.
    798 This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
    799 Consider the case where there are two channels, @A@ and @B@.
    800 There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
    801 Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
    802 Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
    803 This case may not be as pathological as it may seem.
    804 If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
    805 The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
     796This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has lower cache contention costs.
     797
     798The Go approach of holding all internal channel-locks in the \lstinline[language=Go]{select} has additional drawbacks.
     799There are pathological cases where Go's throughput has significant jitter.
     800Consider a producer and consumer thread, @P1@ and @C1@, selecting from both channels @A@ and @B@.
     801\begin{cquote}
     802\begin{tabular}{@{}ll@{}}
     803@P1@ & @C1@ \\
     804\begin{cfa}
     805waituntil( A << i ); or waituntil( B << i );
     806\end{cfa}
     807&
     808\begin{cfa}
     809waituntil( val << A ); or waituntil( val << B );
     810\end{cfa}
     811\end{tabular}
     812\end{cquote}
     813Additionally, there is another producer and consumer thread, @P2@ and @C2@, operating solely on @B@.
     814\begin{cquote}
     815\begin{tabular}{@{}ll@{}}
     816@P2@ & @C2@ \\
     817\begin{cfa}
     818B << val;
     819\end{cfa}
     820&
     821\begin{cfa}
     822val << B;
     823\end{cfa}
     824\end{tabular}
     825\end{cquote}
     826In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
     827Interesting, this case may not be as pathological as it seems.
     828If the set of channels belonging to a select have channels that overlap with the set of another select, these statements lose the ability to operate in parallel.
     829The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism.
    806830Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
    807831The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
     
    812836\setlength{\tabcolsep}{5pt}
    813837
    814 \caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
     838\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically case for contention in Go's select implementation}
    815839\label{t:pathGo}
    816 \begin{tabular}{*{5}{r|}r}
    817         & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
     840\begin{tabular}{r|r|r}
     841                        & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c}{Go} \\
    818842        \hline
    819843        AMD             & \input{data/nasus_Order} \\
     
    824848
    825849Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
    826 Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
    827 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{t:pathGo}.
    828 If \CFA didn't have priorities, the performance difference in Table~\ref{t:pathGo} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
     850Go \emph{randomly} selects a clause~\cite{go:select}, but \CFA chooses in the order clauses are listed.
     851This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour and even better performance.
     852In the previous example, threads @P1@ and @C1@ prioritize channel @A@ in the @waituntil@, which can reduce contention for threads @P2@ and @C2@ accessing channel @B@.
     853If \CFA did not have priorities, the performance difference in Table~\ref{t:pathGo} would be significant less due to extra contention on channel @B@.
    829854
    830855\subsection{Future Benchmark}
    831 The future benchmark compares \CFA's @waituntil@ with \uC's @_Select@, with both utilities waiting on futures.
    832 Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic.
    833 They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
    834 The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
     856
     857The future benchmark compares \CFA's @waituntil@ with \uC's \lstinline[language=uC++]{_Select}, with both utilities waiting on futures.
     858While both statements have very similar semantics, supporting @and@ and @or@ operators, \lstinline[language=uC++]{_Select} can only wait on futures, whereas the @waituntil@ is polymorphic.
     859As such, the underlying implementation of the operators differs between @waituntil@ and \lstinline[language=uC++]{_Select}.
     860The @waituntil@ statement checks for statement completion using a predicate function, whereas the \lstinline[language=uC++]{_Select} statement maintains a tree that represents the state of the internal predicate.
    835861
    836862\begin{figure}
     
    849875\end{figure}
    850876
    851 This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
    852 This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
    853 Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
    854 Each set of columns is marked with a name representing the predicate for that set of columns.
    855 The predicate name and corresponding @waituntil@ statement is shown below:
    856 
    857 \begin{cfa}
    858 #ifdef OR
     877This benchmark aims to indirectly measure the impact of various predicates on the performance of the @waituntil@ and \lstinline[language=uC++]{_Select} statements.
     878The benchmark is indirect since the performance of futures in \CFA and \uC differ by a significant margin.
     879The experiment has a server, which cycles fulfilling three futures, @A@, @B@, and @C@, and a client, which waits for these futures to be fulfilled using four different kinds of predicates given in \CFA:
     880\begin{cquote}
     881\begin{tabular}{@{}l|l@{}}
     882OR & AND \\
     883\hline
     884\begin{cfa}
    859885waituntil( A ) { get( A ); }
    860886or waituntil( B ) { get( B ); }
    861887or waituntil( C ) { get( C ); }
    862 #endif
    863 #ifdef AND
     888\end{cfa}
     889&
     890\begin{cfa}
    864891waituntil( A ) { get( A ); }
    865892and waituntil( B ) { get( B ); }
    866893and waituntil( C ) { get( C ); }
    867 #endif
    868 #ifdef ANDOR
     894\end{cfa}
     895\\
     896\multicolumn{2}{@{}c@{}}{} \\
     897AND-OR & OR-AND \\
     898\hline
     899\begin{cfa}
    869900waituntil( A ) { get( A ); }
    870901and waituntil( B ) { get( B ); }
    871902or waituntil( C ) { get( C ); }
    872 #endif
    873 #ifdef ORAND
    874 (waituntil( A ) { get( A ); }
    875 or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
     903\end{cfa}
     904&
     905\begin{cfa}
     906@(@ waituntil( A ) { get( A ); }
     907or waituntil( B ) { get( B ); } @)@
    876908and waituntil( C ) { get( C ); }
    877 #endif
    878 \end{cfa}
    879 
    880 In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of @waituntil@ statements with only @or@ operators.
    881 For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
     909\end{cfa}
     910\end{tabular}
     911\end{cquote}
     912The server and client use a low cost synchronize after each fulfillment, so the server does not race ahead of the client.
     913
     914Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
     915Each pair of bars is marked with the predicate name for that experiment and the value at the top of each bar is the standard deviation..
     916In detail, \uC results are lower in all cases due to the performance difference between futures and the more complex \gls{synch_multiplex} implementation.
     917However, the bars for both systems have similar height patterns across the experiments.
     918The @OR@ column for \CFA is more performant than the other \CFA predicates, due to the special-casing of @waituntil@ statements with only @or@ operators.
     919For both \uC and \CFA, the @AND@ experiment is the least performant, which is expected since all three futures need to be fulfilled for each statement completion.
    882920Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
     921Given the differences in semantics and implementation between \uC and \CFA, this test only illustrates the overall costs among the different kinds of predicates.
Note: See TracChangeset for help on using the changeset viewer.