Ignore:
Timestamp:
Jun 27, 2023, 4:45:40 PM (12 months ago)
Author:
caparsons <caparson@…>
Branches:
master
Children:
a1f0cb6
Parents:
917e1fd
Message:

first draft of full waituntil chapter and conclusion chapter. Lots of graph/plotting utilities cleanup. Reran all CFA actor benchmarks after recent changes. Small changes to actor.tex in performance section

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/theses/colby_parsons_MMAth/text/waituntil.tex

    r917e1fd r14e1053  
    8080This enables fully expressive \gls{synch_multiplex} predicates.
    8181
    82 There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channe}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
     82There are many other languages that provide \gls{synch_multiplex}, including Rust's @select!@ over futures~\cite{rust:select}, OCaml's @select@ over channels~\cite{ocaml:channel}, and C++14's @when_any@ over futures~\cite{cpp:whenany}.
    8383Note that while C++14 and Rust provide \gls{synch_multiplex}, their implemetations leave much to be desired as they both rely on busy-waiting polling to wait on multiple resources.
    8484
     
    9999All of the \gls{synch_multiplex} features mentioned so far are monomorphic, only supporting one resource to wait on, select(2) supports file descriptors, Go's select supports channel operations, \uC's select supports futures, and Ada's select supports monitor method calls.
    100100The waituntil statement in \CFA is polymorphic and provides \gls{synch_multiplex} over any objects that satisfy the trait in Figure~\ref{f:wu_trait}.
     101No other language provides a synchronous multiplexing tool polymorphic over resources like \CFA's waituntil.
     102All others them tie themselves to some specific type of resource.
    101103
    102104\begin{figure}
     
    370372
    371373\subsection{Channel Benchmark}
    372 The channel microbenchmark compares \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
    373 
    374 %C_TODO explain benchmark
    375 
    376 %C_TODO show results
    377 
    378 %C_TODO discuss results
     374The channel multiplexing microbenchmarks compare \CFA's waituntil and Go's select, where the resource being waited on is a set of channels.
     375The basic structure of the microbenchmark has the number of cores split evenly between producer and consumer threads, \ie, with 8 cores there would be 4 producer threads and 4 consumer threads.
     376The number of clauses @C@ is also varied, with results shown with 2, 4, and 8 clauses.
     377Each clause has a respective channel that is operates on.
     378Each producer and consumer repeatedly waits to either produce or consume from one of the @C@ clauses and respective channels.
     379An example in \CFA syntax of the work loop in the consumer main with @C = 4@ clauses follows.
     380
     381\begin{cfa}
     382    for (;;)
     383        waituntil( val << chans[0] ) {} or waituntil( val << chans[1] ) {}
     384        or waituntil( val << chans[2] ) {} or waituntil( val << chans[3] ) {}
     385\end{cfa}
     386A successful consumption is counted as a channel operation, and the throughput of these operations is measured over 10 seconds.
     387The first microbenchmark measures throughput of the producers and consumer synchronously waiting on the channels and the second has the threads asynchronously wait on the channels.
     388The results are shown in Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} respectively.
     389
     390\begin{figure}
     391        \centering
     392    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
     393        \subfloat[AMD]{
     394                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_2.pgf}}
     395        }
     396        \subfloat[Intel]{
     397                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_2.pgf}}
     398        }
     399    \bigskip
     400
     401        \subfloat[AMD]{
     402                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_4.pgf}}
     403        }
     404        \subfloat[Intel]{
     405                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_4.pgf}}
     406        }
     407    \bigskip
     408
     409        \subfloat[AMD]{
     410                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Contend_8.pgf}}
     411        }
     412        \subfloat[Intel]{
     413                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Contend_8.pgf}}
     414        }
     415        \caption{The channel synchronous multiplexing benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
     416        \label{f:select_contend_bench}
     417\end{figure}
     418
     419\begin{figure}
     420        \centering
     421    \captionsetup[subfloat]{labelfont=footnotesize,textfont=footnotesize}
     422        \subfloat[AMD]{
     423                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_2.pgf}}
     424        }
     425        \subfloat[Intel]{
     426                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_2.pgf}}
     427        }
     428    \bigskip
     429
     430        \subfloat[AMD]{
     431                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_4.pgf}}
     432        }
     433        \subfloat[Intel]{
     434                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_4.pgf}}
     435        }
     436    \bigskip
     437
     438        \subfloat[AMD]{
     439                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Spin_8.pgf}}
     440        }
     441        \subfloat[Intel]{
     442                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Spin_8.pgf}}
     443        }
     444        \caption{The asynchronous multiplexing channel benchmark comparing Go select and \CFA waituntil statement throughput (higher is better).}
     445        \label{f:select_spin_bench}
     446\end{figure}
     447
     448Both Figures~\ref{f:select_contend_bench} and~\ref{f:select_spin_bench} have similar results when comparing @select@ and @waituntil@.
     449In the AMD benchmarks, the performance is very similar as the number of cores scale.
     450The AMD machine has been observed to have higher caching contention cost, which creates on a bottleneck on the channel locks, which results in similar scaling between \CFA and Go.
     451At low cores, Go has significantly better performance, which is likely due to an optimization in their scheduler.
     452Go heavily optimizes thread handoffs on their local runqueue, which can result in very good performance for low numbers of threads which are parking/unparking eachother~\cite{go:sched}.
     453In the Intel benchmarks, \CFA performs better than Go as the number of cores scale and as the number of clauses scale.
     454This is likely due to Go's implementation choice of acquiring all channel locks when registering and unregistering channels on a @select@.
     455Go then has to hold a lock for every channel, so it follows that this results in worse performance as the number of channels increase.
     456In \CFA, since races are consolidated without holding all locks, it scales much better both with cores and clauses since more work can occur in parallel.
     457This scalability difference is more significant on the Intel machine than the AMD machine since the Intel machine has been observed to have lower cache contention costs.
     458
     459The Go approach of holding all internal channel locks in the select has some additional drawbacks.
     460This approach results in some pathological cases where Go's system throughput on channels can greatly suffer.
     461Consider the case where there are two channels, @A@ and @B@.
     462There are both a producer thread and a consumer thread, @P1@ and @C1@, selecting both @A@ and @B@.
     463Additionally, there is another producer and another consumer thread, @P2@ and @C2@, that are both operating solely on @B@.
     464Compared to \CFA this setup results in significantly worse performance since @P2@ and @C2@ cannot operate in parallel with @P1@ and @C1@ due to all locks being acquired.
     465This case may not be as pathological as it may seem.
     466If the set of channels belonging to a select have channels that overlap with the set of another select, they lose the ability to operate on their select in parallel.
     467The implementation in \CFA only ever holds a single lock at a time, resulting in better locking granularity.
     468Comparison of this pathological case is shown in Table~\ref{t:pathGo}.
     469The AMD results highlight the worst case scenario for Go since contention is more costly on this machine than the Intel machine.
     470
     471\begin{table}[t]
     472\centering
     473\setlength{\extrarowheight}{2pt}
     474\setlength{\tabcolsep}{5pt}
     475
     476\caption{Throughput (channel operations per second) of \CFA and Go for a pathologically bad case for contention in Go's select implementation}
     477\label{t:pathGo}
     478\begin{tabular}{*{5}{r|}r}
     479    & \multicolumn{1}{c|}{\CFA} & \multicolumn{1}{c@{}}{Go} \\
     480    \hline
     481    AMD         & \input{data/nasus_Order} \\
     482    \hline
     483    Intel       & \input{data/pyke_Order}
     484\end{tabular}
     485\end{table}
     486
     487Another difference between Go and \CFA is the order of clause selection when multiple clauses are available.
     488Go "randomly" selects a clause, but \CFA chooses the clause in the order they are listed~\cite{go:select}.
     489This \CFA design decision allows users to set implicit priorities, which can result in more predictable behaviour, and even better performance in certain cases, such as the case shown in  Table~\ref{}.
     490If \CFA didn't have priorities, the performance difference in Table~\ref{} would be less significant since @P1@ and @C1@ would try to compete to operate on @B@ more often with random selection.
    379491
    380492\subsection{Future Benchmark}
    381493The future benchmark compares \CFA's waituntil with \uC's @_Select@, with both utilities waiting on futures.
    382 
    383 %C_TODO explain benchmark
    384 
    385 %C_TODO show results
    386 
    387 %C_TODO discuss results
     494Both \CFA's @waituntil@ and \uC's @_Select@ have very similar semantics, however @_Select@ can only wait on futures, whereas the @waituntil@ is polymorphic.
     495They both support @and@ and @or@ operators, but the underlying implementation of the operators differs between @waituntil@ and @_Select@.
     496The @waituntil@ statement checks for statement completion using a predicate function, whereas the @_Select@ statement maintains a tree that represents the state of the internal predicate.
     497
     498\begin{figure}
     499        \centering
     500        \subfloat[AMD Future Synchronization Benchmark]{
     501                \resizebox{0.5\textwidth}{!}{\input{figures/nasus_Future.pgf}}
     502                \label{f:futureAMD}
     503        }
     504        \subfloat[Intel Future Synchronization Benchmark]{
     505                \resizebox{0.5\textwidth}{!}{\input{figures/pyke_Future.pgf}}
     506                \label{f:futureIntel}
     507        }
     508        \caption{\CFA waituntil and \uC \_Select statement throughput synchronizing on a set of futures with varying wait predicates (higher is better).}
     509    \caption{}
     510        \label{f:futurePerf}
     511\end{figure}
     512
     513This microbenchmark aims to measure the impact of various predicates on the performance of the @waituntil@ and @_Select@ statements.
     514This benchmark and section does not try to directly compare the @waituntil@ and @_Select@ statements since the performance of futures in \CFA and \uC differ by a significant margin, making them incomparable.
     515Results of this benchmark are shown in Figure~\ref{f:futurePerf}.
     516Each set of columns is marked with a name representing the predicate for that set of columns.
     517The predicate name and corresponding waituntil statement is shown below:
     518
     519\begin{cfa}
     520#ifdef OR
     521waituntil( A ) { get( A ); }
     522or waituntil( B ) { get( B ); }
     523or waituntil( C ) { get( C ); }
     524#endif
     525#ifdef AND
     526waituntil( A ) { get( A ); }
     527and waituntil( B ) { get( B ); }
     528and waituntil( C ) { get( C ); }
     529#endif
     530#ifdef ANDOR
     531waituntil( A ) { get( A ); }
     532and waituntil( B ) { get( B ); }
     533or waituntil( C ) { get( C ); }
     534#endif
     535#ifdef ORAND
     536(waituntil( A ) { get( A ); }
     537or waituntil( B ) { get( B ); }) // brackets create higher precedence for or
     538and waituntil( C ) { get( C ); }
     539#endif
     540\end{cfa}
     541
     542In Figure~\ref{f:futurePerf}, the @OR@ column for \CFA is more performant than the other \CFA predicates, likely due to the special-casing of waituntil statements with only @or@ operators.
     543For both \uC and \CFA the @AND@ column is the least performant, which is expected since all three futures need to be fulfilled for each statement completion, unlike any of the other operators.
     544Interestingly, \CFA has lower variation across predicates on the AMD (excluding the special OR case), whereas \uC has lower variation on the Intel.
Note: See TracChangeset for help on using the changeset viewer.