Changeset 1ae3f185 for doc/theses/colby_parsons_MMAth/text/waituntil.tex
- Timestamp:
- Jul 21, 2023, 11:57:24 AM (16 months ago)
- Branches:
- master
- Children:
- c1b6bc6
- Parents:
- d4c1d1f
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/theses/colby_parsons_MMAth/text/waituntil.tex
rd4c1d1f r1ae3f185 24 24 Waiting for the first appropriate stall (resource) that becomes available without spinning is an example of \gls{synch_multiplex}: the ability to wait synchronously for one or more resources based on some selection criteria. 25 25 26 \section{History of Synchronous Multiplexing} 26 \section{History of Synchronous Multiplexing}\label{s:History} 27 27 28 There is a history of tools that provide \gls{synch_multiplex}. 28 29 Some well known \gls{synch_multiplex} tools include Unix system utilities: @select@~\cite{linux:select}, @poll@~\cite{linux:poll}, and @epoll@~\cite{linux:epoll}, and the @select@ statement provided by Go~\cite{go:selectref}, Ada~\cite[\S~9.7]{Ada16}, and \uC~\cite[\S~3.3.1]{uC++}. … … 560 561 \multicolumn{2}{@{}l@{}}{\lstinline{channel A, B; // zero size channels}} \\ 561 562 thread 1 & thread 2 \\ 562 \begin{cfa} 563 \begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}] 563 564 waituntil( @i << A@ ) {} 564 or waituntil( @i << B@) {}565 or waituntil( #i << B# ) {} 565 566 \end{cfa} 566 567 & 567 \begin{cfa} 568 waituntil( @B << 2}@) {}568 \begin{cfa}[moredelim={**[is][\color{blue}]{\#}{\#}}] 569 waituntil( #B << 2# ) {} 569 570 or waituntil( @A << 1@ ) {} 570 571 \end{cfa} … … 620 621 In the following example, the code blocks run once their corresponding predicate inside the round braces is satisfied. 621 622 % C_TODO put this is uC++ code style not cfa-style 622 \begin{lstlisting}[language=uC++ ]623 \begin{lstlisting}[language=uC++,{moredelim=**[is][\color{red}]{@}{@}}] 623 624 Future_ISM<int> A, B, C, D; 624 _Select( A || B && C) { ... }625 and _Select( D && E) { ... }625 _Select( @A || B && C@ ) { ... } 626 and _Select( @D && E@ ) { ... } 626 627 \end{lstlisting} 627 628 This is more expressive that the @waituntil@ statement in \CFA, allowing the code block for @&&@ to only run after both resources are available. … … 699 700 \section{Waituntil Performance} 700 701 701 Similar facilities to @waituntil@ are discussed at the start of this chapter inC, Ada, Rust, \CC, and OCaml.702 Similar facilities to @waituntil@ are discussed in Section~\ref{s:History} covering C, Ada, Rust, \CC, and OCaml. 702 703 However, these facilities are either not meaningful or feasible to benchmark against. 703 704 The UNIX @select@ and related utilities are not comparable since they are system calls that go into the kernel and operate on file descriptors, whereas the @waituntil@ exists solely in user space. 704 Ada's @select@ only operates on methods, which is done in \CFA via the @waitfor@ utility so it is not meaningful to benchmark against the @waituntil@, which cannot wait on the sameresource.705 Rust and \CC only offer a busy-wait basedapproach, which is not comparable to a blocking approach.705 Ada's \lstinline[language=Ada]{select} and \uC's \lstinline[language=uC++]{_Accept} only operates on method calls, which is done in \CFA via the @waitfor@ statement, so it is not meaningful to benchmark against the @waituntil@, which cannot wait on this resource. 706 Rust and \CC only offer a busy-wait approach, which is not comparable to a blocking approach. 706 707 OCaml's @select@ waits on channels that are not comparable with \CFA and Go channels, so OCaml @select@ is not benchmarked against Go's @select@ and \CFA's @waituntil@. 707 708 708 The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go @select@ statement and the \uC @_Select@statement.709 As such, two microbenchmarks are presented, one for Go and one for \uC to contrast th e systems.710 Given the differences in features, polymorphism, and expressibility between @waituntil@ and @select@, and @_Select@, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner.709 The two \gls{synch_multiplex} utilities that are in the realm of comparability with the \CFA @waituntil@ statement are the Go \lstinline[language=Go]{select} statement and the \uC \lstinline[language=uC++]{_Select} statement. 710 As such, two microbenchmarks are presented, one for Go and one for \uC to contrast this feature. 711 Given the differences in features, polymorphism, and expressibility between @waituntil@ and \lstinline[language=Go]{select}, and \uC \lstinline[language=uC++]{_Select}, the aim of the microbenchmarking in this chapter is to show that these implementations lie in the same realm of performance, not to pick a winner. 711 712 712 713 \subsection{Channel Benchmark} 713 714 714 The channel multiplexing microbenchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels.715 The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads.716 The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has different channel that is waits on.715 The channel multiplexing benchmarks compare \CFA's @waituntil@ and Go's \lstinline[language=Go]{select}, where the resource being waited on is a set of channels. 716 The basic structure of the benchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there are 4 producer and 4 consumer threads. 717 The number of resource clauses $C$ is also varied across 2, 4, and 8 clauses, where each clause has a different channel that is waits on. 717 718 Each producer and consumer repeatedly waits to either produce or consume from one of the $C$ clauses and respective channels. 718 719 For example, in \CFA syntax, the work loop in the consumer main with $C = 4$ clauses is: … … 723 724 \end{cfa} 724 725 A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds. 725 The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.726 The first benchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels using the Go @default@ and \CFA @else@ clause. 726 727 The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively. 727 728 … … 784 785 \end{figure} 785 786 786 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@and @waituntil@.787 In the AMD benchmarks , the performance is very similar as the number of cores scale.788 The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between\CFA and Go.789 At low cores, Go has significantly better performance, which is likely due to an optimization in theirscheduler.790 Go heavily optimizes thread handoffs on the ir local run-queue, which can result in very good performance for low numbers of threads which areparking/unparking each other~\cite{go:sched}.791 In the Intel benchmarks , \CFA performs better than Go as the number of cores scale and as the number of clauses scale.792 This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.793 Go then has to hold a lock for every channel, so it follows that this resultsin worse performance as the number of channels increase.787 Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing \lstinline[language=Go]{select} and @waituntil@. 788 In the AMD benchmarks (left column), the performance is very similar as the number of cores scale. 789 The AMD machine has a high-caching contention cost because of its \emph{chicklet} L3 cache (\ie many L3 caches servicing a small number of cores), which creates a bottleneck on the channel locks and dominates the shape of the performance curve for both \CFA and Go. 790 Hence, it is difficult to detect differences in the \gls{synch_multiplex}, except at low cores, where Go has significantly better performance, due to an optimization in its scheduler. 791 Go heavily optimizes thread handoffs on the local run-queue, which can result in very good performance for low numbers of threads parking/unparking each other~\cite{go:sched}. 792 In the Intel benchmarks (right column), \CFA performs better than Go as the number of cores scales past 2/4 and as the number of clauses increase. 793 This difference is due to Go's acquiring all channel locks when registering and unregistering channels on a \lstinline[language=Go]{select}. 794 Go then is holding a lock for every channel, resulting in worse performance as the number of channels increase. 794 795 In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel. 795 This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs. 796 797 The Go approach of holding all internal channel locks in the select has some additional drawbacks. 798 This approach results in some pathological cases where Go's system throughput on channels can greatly suffer. 799 Consider the case where there are two channels, @A@ and @B@. 800 There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@. 801 Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@. 802 Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired. 803 This case may not be as pathological as it may seem. 804 If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel. 805 The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity. 796 This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has lower cache contention costs. 797 798 The Go approach of holding all internal channel-locks in the \lstinline[language=Go]{select} has additional drawbacks. 799 There are pathological cases where Go's throughput has significant jitter. 800 Consider a producer and consumer thread, @P1@ and @C1@, selecting from both channels @A@ and @B@. 801 \begin{cquote} 802 \begin{tabular}{@{}ll@{}} 803 @P1@ & @C1@ \\ 804 \begin{cfa} 805 waituntil( A << i ); or waituntil( B << i ); 806 \end{cfa} 807 & 808 \begin{cfa} 809 waituntil( val << A ); or waituntil( val << B ); 810 \end{cfa} 811 \end{tabular} 812 \end{cquote} 813 Additionally, there is another producer and consumer thread, @P2@ and @C2@, operating solely on @B@. 814 \begin{cquote} 815 \begin{tabular}{@{}ll@{}} 816 @P2@ & @C2@ \\ 817 \begin{cfa} 818 B << val; 819 \end{cfa} 820 & 821 \begin{cfa} 822 val << B; 823 \end{cfa} 824 \end{tabular} 825 \end{cquote} 826 In Go, this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired. 827 Interesting, this case may not be as pathological as it seems. 828 If the set of channels belonging to a select have channels that overlap with the set of another select, these statements lose the ability to operate in parallel. 829 The implementation in \CFA only holds a single lock at a time, resulting in better locking granularity, and hence, more parallelism. 806 830 Comparison of this pathological case is shown in Table~\ref{t:pathGo}. 807 831 The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine. … … 812 836 \setlength{\tabcolsep}{5pt} 813 837 814 \caption{Throughput (channel operations per second) of \CFA and Go for a pathologically badcase for contention in Go's select implementation}838 \caption{Throughput (channel operations per second) of \CFA and Go for a pathologically case for contention in Go's select implementation} 815 839 \label{t:pathGo} 816 \begin{tabular}{ *{5}{r|}r}817 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\840 \begin{tabular}{r|r|r} 841 & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c}{Go} \\ 818 842 \hline 819 843 AMD & \input{data/nasus_Order} \\ … … 824 848 825 849 Another difference between Go and \CFA is the order of clause selection when multiple clauses are available. 826 Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}. 827 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in Table~\ref{t:pathGo}. 828 If \CFA didn't have priorities, the performance difference in Table~\ref{t:pathGo} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection. 850 Go \emph{randomly} selects a clause~\cite{go:select}, but \CFA chooses in the order clauses are listed. 851 This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour and even better performance. 852 In the previous example, threads @P1@ and @C1@ prioritize channel @A@ in the @waituntil@, which can reduce contention for threads @P2@ and @C2@ accessing channel @B@. 853 If \CFA did not have priorities, the performance difference in Table~\ref{t:pathGo} would be significant less due to extra contention on channel @B@. 829 854 830 855 \subsection{Future Benchmark} 831 The future benchmark compares \CFA's @waituntil@ with \uC's @_Select@, with both utilities waiting on futures. 832 Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic. 833 They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@. 834 The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate. 856 857 The future benchmark compares \CFA's @waituntil@ with \uC's \lstinline[language=uC++]{_Select}, with both utilities waiting on futures. 858 While both statements have very similar semantics, supporting @and@ and @or@ operators, \lstinline[language=uC++]{_Select} can only wait on futures, whereas the @waituntil@ is polymorphic. 859 As such, the underlying implementation of the operators differs between @waituntil@ and \lstinline[language=uC++]{_Select}. 860 The @waituntil@ statement checks for statement completion using a predicate function, whereas the \lstinline[language=uC++]{_Select} statement maintains a tree that represents the state of the internal predicate. 835 861 836 862 \begin{figure} … … 849 875 \end{figure} 850 876 851 This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@statements.852 Th is benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.853 Results of this benchmark are shown in Figure~\ref{f:futurePerf}. 854 Each set of columns is marked with a name representing the predicate for that set of columns. 855 The predicate name and corresponding @waituntil@ statement is shown below: 856 857 \ begin{cfa}858 #ifdef OR 877 This benchmark aims to indirectly measure the impact of various predicates on the performance of the @waituntil@ and \lstinline[language=uC++]{_Select} statements. 878 The benchmark is indirect since the performance of futures in \CFA and \uC differ by a significant margin. 879 The experiment has a server, which cycles fulfilling three futures, @A@, @B@, and @C@, and a client, which waits for these futures to be fulfilled using four different kinds of predicates given in \CFA: 880 \begin{cquote} 881 \begin{tabular}{@{}l|l@{}} 882 OR & AND \\ 883 \hline 884 \begin{cfa} 859 885 waituntil( A ) { get( A ); } 860 886 or waituntil( B ) { get( B ); } 861 887 or waituntil( C ) { get( C ); } 862 #endif 863 #ifdef AND 888 \end{cfa} 889 & 890 \begin{cfa} 864 891 waituntil( A ) { get( A ); } 865 892 and waituntil( B ) { get( B ); } 866 893 and waituntil( C ) { get( C ); } 867 #endif 868 #ifdef ANDOR 894 \end{cfa} 895 \\ 896 \multicolumn{2}{@{}c@{}}{} \\ 897 AND-OR & OR-AND \\ 898 \hline 899 \begin{cfa} 869 900 waituntil( A ) { get( A ); } 870 901 and waituntil( B ) { get( B ); } 871 902 or waituntil( C ) { get( C ); } 872 #endif 873 #ifdef ORAND 874 (waituntil( A ) { get( A ); } 875 or waituntil( B ) { get( B ); }) // brackets create higher precedence for or 903 \end{cfa} 904 & 905 \begin{cfa} 906 @(@ waituntil( A ) { get( A ); } 907 or waituntil( B ) { get( B ); } @)@ 876 908 and waituntil( C ) { get( C ); } 877 #endif 878 \end{cfa} 879 880 In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of @waituntil@ statements with only @or@ operators. 881 For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators. 909 \end{cfa} 910 \end{tabular} 911 \end{cquote} 912 The server and client use a low cost synchronize after each fulfillment, so the server does not race ahead of the client. 913 914 Results of this benchmark are shown in Figure~\ref{f:futurePerf}. 915 Each pair of bars is marked with the predicate name for that experiment and the value at the top of each bar is the standard deviation.. 916 In detail, \uC results are lower in all cases due to the performance difference between futures and the more complex \gls{synch_multiplex} implementation. 917 However, the bars for both systems have similar height patterns across the experiments. 918 The @OR@ column for \CFA is more performant than the other \CFA predicates, due to the special-casing of @waituntil@ statements with only @or@ operators. 919 For both \uC and \CFA, the @AND@ experiment is the least performant, which is expected since all three futures need to be fulfilled for each statement completion. 882 920 Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel. 921 Given the differences in semantics and implementation between \uC and \CFA, this test only illustrates the overall costs among the different kinds of predicates.
Note: See TracChangeset
for help on using the changeset viewer.