Ignore:
Timestamp:
Jul 29, 2019, 1:47:10 PM (5 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
1d76f8a4, 2385236
Parents:
be53b87
Message:

make Dave Dice changes to concurrency paper

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    rbe53b87 r5453237  
    307307In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control-flow capabilities on top of a highly popular programming language allowing immediate dissemination.
    308308Within the \CFA framework, new control-flow features are created from scratch because ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}.
    309 However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error-prone;
     309However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and mutex/condition locks, which is low-level and error-prone;
    310310no high-level language concurrency features are defined.
    311 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
     311Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach (possibly because the effort to add concurrency to \CC).
    312312Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
    313313
     
    333333
    334334Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
    335 Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging} (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow.
     335Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging}\footnote{
     336The notion of competitive succession instead of direct handoff, \ie a lock owner releases the lock and an arriving thread acquires it ahead of preexisting waiter threads.
     337} (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow.
    336338However, spurious wakeup is \emph{not} a foundational concurrency property~\cite[\S~8]{Buhr05a}, it is a performance design choice.
    337339Similarly, signals-as-hints are often a performance decision.
     
    351353We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
    352354The main contributions of this work are:
    353 \begin{itemize}
     355\begin{itemize}[topsep=3pt,itemsep=1pt]
    354356\item
    355357language-level generators, coroutines and user-level threading, which respect the expectations of C programmers.
     
    370372\end{itemize}
    371373
     374Section~\ref{s:StatefulFunction} begins advanced control by introducing sequential functions that retain data and execution state between calls, which produces constructs @generator@ and @coroutine@.
     375Section~\ref{s:Concurrency} begins concurrency, or how to create (fork) and destroy (join) a thread, which produces the @thread@ construct.
     376Section~\ref{s:MutualExclusionSynchronization} discusses the two mechanisms to restricted nondeterminism when controlling shared access to resources (mutual exclusion) and timing relationships among threads (synchronization).
     377Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
     378Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure (cluster) threads and virtual processors (kernel threads).
     379Section~\ref{s:Performance} uses a series of microbenchmarks to compare \CFA threading with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0.
     380
    372381
    373382\section{Stateful Function}
     383\label{s:StatefulFunction}
    374384
    375385The stateful function is an old idea~\cite{Conway63,Marlin80} that is new again~\cite{C++20Coroutine19}, where execution is temporarily suspended and later resumed, \eg plugin, device driver, finite-state machine.
     
    617627Figure~\ref{f:CFibonacciSim} shows the C implementation of the \CFA generator only needs one additional field, @next@, to handle retention of execution state.
    618628The computed @goto@ at the start of the generator main, which branches after the previous suspend, adds very little cost to the resume call.
    619 Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.
     629Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.\footnote{
     630The \CFA operator syntax uses \lstinline|?| to denote operands, which allows precise definitions for pre, post, and infix operators, \eg \lstinline|++?|, \lstinline|?++|, and \lstinline|?+?|, in addition \lstinline|?\{\}| denotes a constructor, as in \lstinline|foo `f` = `\{`...`\}`|, \lstinline|^?\{\}| denotes a destructor, and \lstinline|?()| is \CC function call \lstinline|operator()|.
     631}%
    620632\begin{cfa}
    621633int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$
     
    15111523
    15121524\section{Mutual Exclusion / Synchronization}
     1525\label{s:MutualExclusionSynchronization}
    15131526
    15141527Unrestricted nondeterminism is meaningless as there is no way to know when the result is completed without synchronization.
     
    15511564higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg receive-specific versus receive-any thread in message passing or offering specialized solutions, \eg barrier lock.
    15521565Often synchronization is used to order access to a critical section, \eg ensuring a waiting writer thread enters the critical section before a calling reader thread.
    1553 If the calling reader is scheduled before the waiting writer, the reader has \newterm{barged}.
     1566If the calling reader is scheduled before the waiting writer, the reader has barged.
    15541567Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation).
    15551568Preventing or detecting barging is an involved challenge with low-level locks, which is made easier through higher-level constructs.
     
    21202133
    21212134
    2122 \subsection{Extended \protect\lstinline@waitfor@}
    2123 
    2124 Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes.
     2135\subsection{\texorpdfstring{Extended \protect\lstinline@waitfor@}{Extended waitfor}}
     2136
     2137Figure~\ref{f:ExtendedWaitfor} shows the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes.
    21252138For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist.
    21262139The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch.
     
    21312144Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking.
    21322145If both @timeout@ and @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered.
     2146There is also a traditional future wait queue (not shown) (\eg Microsoft (@WaitForMultipleObjects@)), to wait for a specified number of future elements in the queue.
    21332147
    21342148\begin{figure}
     
    23552369
    23562370
    2357 \subsection{\protect\lstinline@mutex@ Threads}
     2371\subsection{\texorpdfstring{\protect\lstinline@mutex@ Threads}{mutex Threads}}
    23582372
    23592373Threads in \CFA can also be monitors to allow \emph{direct communication} among threads, \ie threads can have mutex functions that are called by other threads.
     
    24992513\renewcommand{\arraystretch}{1.25}
    25002514%\setlength{\tabcolsep}{5pt}
    2501 \begin{tabular}{c|c|l|l}
    2502 \multicolumn{2}{c|}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\
     2515\begin{tabular}{c|c||l|l}
     2516\multicolumn{2}{c||}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\
    25032517\hline
    25042518thread  & stateful                              & \multicolumn{1}{c|}{No} & \multicolumn{1}{c}{Yes} \\
     
    26052619
    26062620
    2607 \section{\protect\CFA Runtime Structure}
     2621\section{Runtime Structure}
    26082622\label{s:CFARuntimeStructure}
    26092623
     
    27092723
    27102724\section{Performance}
    2711 \label{results}
     2725\label{s:Performance}
    27122726
    27132727To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0.
     
    27152729The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \CFA/\uC are compiled with gcc 6.5.
    27162730
    2717 All benchmarks are run using the following harness.
     2731All benchmarks are run using the following harness. (The Java harness is augmented to circumvent JIT issues.)
    27182732\begin{cfa}
    27192733unsigned int N = 10_000_000;
     
    27542768\begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
    27552769\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2756 \CFA Coroutine Lazy             & 14.3          & 14.3          & 0.32          \\
    2757 \CFA Coroutine Eager    & 522.8         & 525.3         & 5.81          \\
    2758 \CFA Thread                             & 1257.8        & 1291.2        & 86.19         \\
    2759 \uC Coroutine                   & 92.2          & 91.4          & 1.58          \\
    2760 \uC Thread                              & 499.5         & 500.1         & 5.67          \\
    2761 Goroutine                               & 4397.0        & 4362.8        & 390.77        \\
    2762 Java Thread                             & 107405.0      & 107794.8      & 1601.33       \\
    2763 % Qthreads                              & 159.9         & 159.6         & 0.73          \\
    2764 Pthreads                                & 32920.9       & 32882.7       & 213.55
     2770\CFA Coroutine Lazy             & 13.2          & 13.1          & 0.44          \\
     2771\CFA Coroutine Eager    & 531.3         & 536.0         & 26.54         \\
     2772\CFA Thread                             & 2074.9        & 2066.5        & 170.76        \\
     2773\uC Coroutine                   & 89.6          & 90.5          & 1.83          \\
     2774\uC Thread                              & 528.2         & 528.5         & 4.94          \\
     2775Goroutine                               & 4068.0        & 4113.1        & 414.55        \\
     2776Java Thread                             & 103848.5      & 104295.4      & 2637.57       \\
     2777Pthreads                                & 33112.6       & 33127.1       & 165.90
     2778\end{tabular}
     2779\end{multicols}
     2780
     2781
     2782\paragraph{Context-Switching}
     2783
     2784In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
     2785(In many cases, a compiler inlines function calls to eliminate this cost.)
     2786Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
     2787The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
     2788The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
     2789The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
     2790Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
     2791
     2792\begin{multicols}{2}
     2793\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     2794\begin{cfa}[aboveskip=0pt,belowskip=0pt]
     2795@coroutine@ C {} c;
     2796void main( C & ) { for ( ;; ) { @suspend;@ } }
     2797int main() { // coroutine test
     2798        BENCH( for ( N ) { @resume( c );@ } )
     2799        sout | result`ns;
     2800}
     2801int main() { // task test
     2802        BENCH( for ( N ) { @yield();@ } )
     2803        sout | result`ns;
     2804}
     2805\end{cfa}
     2806\captionof{figure}{\CFA context-switch benchmark}
     2807\label{f:ctx-switch}
     2808
     2809\columnbreak
     2810
     2811\vspace*{-16pt}
     2812\captionof{table}{Context switch comparison (nanoseconds)}
     2813\label{tab:ctx-switch}
     2814\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     2815\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     2816C function              & 1.8   & 1.8   & 0.01  \\
     2817\CFA generator  & 2.4   & 2.2   & 0.25  \\
     2818\CFA Coroutine  & 36.2  & 36.2  & 0.25  \\
     2819\CFA Thread             & 93.2  & 93.5  & 2.09  \\
     2820\uC Coroutine   & 52.0  & 52.1  & 0.51  \\
     2821\uC Thread              & 96.2  & 96.3  & 0.58  \\
     2822Goroutine               & 141.0 & 141.3 & 3.39  \\
     2823Java Thread             & 374.0 & 375.8 & 10.38 \\
     2824Pthreads Thread & 361.0 & 365.3 & 13.19
     2825\end{tabular}
     2826\end{multicols}
     2827
     2828
     2829\paragraph{Mutual-Exclusion}
     2830
     2831Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section.
     2832For monitors, entering and leaving a monitor function is measured.
     2833To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
     2834Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
     2835Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     2836
     2837\begin{multicols}{2}
     2838\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     2839\begin{cfa}
     2840@monitor@ M {} m1/*, m2, m3, m4*/;
     2841void __attribute__((noinline))
     2842do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
     2843int main() {
     2844        BENCH(
     2845                for( N ) do_call( m1/*, m2, m3, m4*/ );
     2846        )
     2847        sout | result`ns;
     2848}
     2849\end{cfa}
     2850\captionof{figure}{\CFA acquire/release mutex benchmark}
     2851\label{f:mutex}
     2852
     2853\columnbreak
     2854
     2855\vspace*{-16pt}
     2856\captionof{table}{Mutex comparison (nanoseconds)}
     2857\label{tab:mutex}
     2858\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     2859\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     2860test and test-and-test lock             & 19.1  & 18.9  & 0.40  \\
     2861\CFA @mutex@ function, 1 arg.   & 45.9  & 46.6  & 1.45  \\
     2862\CFA @mutex@ function, 2 arg.   & 105.0 & 104.7 & 3.08  \\
     2863\CFA @mutex@ function, 4 arg.   & 165.0 & 167.6 & 5.65  \\
     2864\uC @monitor@ member rtn.               & 54.0  & 53.7  & 0.82  \\
     2865Java synchronized method                & 31.0  & 31.1  & 0.50  \\
     2866Pthreads Mutex Lock                             & 33.6  & 32.6  & 1.14
     2867\end{tabular}
     2868\end{multicols}
     2869
     2870
     2871\paragraph{External Scheduling}
     2872
     2873External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
     2874Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}.
     2875Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     2876
     2877\begin{multicols}{2}
     2878\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     2879\vspace*{-16pt}
     2880\begin{cfa}
     2881volatile int go = 0;
     2882@monitor@ M {} m;
     2883thread T {};
     2884void __attribute__((noinline))
     2885do_call( M & @mutex@ ) {}
     2886void main( T & ) {
     2887        while ( go == 0 ) { yield(); }
     2888        while ( go == 1 ) { do_call( m ); }
     2889}
     2890int __attribute__((noinline))
     2891do_wait( M & @mutex@ m ) {
     2892        go = 1; // continue other thread
     2893        BENCH( for ( N ) { @waitfor( do_call, m );@ } )
     2894        go = 0; // stop other thread
     2895        sout | result`ns;
     2896}
     2897int main() {
     2898        T t;
     2899        do_wait( m );
     2900}
     2901\end{cfa}
     2902\captionof{figure}{\CFA external-scheduling benchmark}
     2903\label{f:ext-sched}
     2904
     2905\columnbreak
     2906
     2907\vspace*{-16pt}
     2908\captionof{table}{External-scheduling comparison (nanoseconds)}
     2909\label{tab:ext-sched}
     2910\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     2911\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     2912\CFA @waitfor@, 1 @monitor@     & 376.4 & 376.8 & 7.63  \\
     2913\CFA @waitfor@, 2 @monitor@     & 491.4 & 492.0 & 13.31 \\
     2914\CFA @waitfor@, 4 @monitor@     & 681.0 & 681.7 & 19.10 \\
     2915\uC @_Accept@                           & 331.1 & 331.4 & 2.66
    27652916\end{tabular}
    27662917\end{multicols}
     
    28102961\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
    28112962\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2812 \CFA @signal@, 1 @monitor@      & 367.0         & 371.5         & 17.34         \\
    2813 \CFA @signal@, 2 @monitor@      & 477.2         & 478.6         & 8.31          \\
    2814 \CFA @signal@, 4 @monitor@      & 725.8         & 734.0         & 17.98         \\
    2815 \uC @signal@                            & 322.8         & 323.0         & 3.64          \\
    2816 Java @notify@                           & 16520.0       & 20096.7       & 9378.53       \\
    2817 Pthreads Cond. Variable         & 4931.3        & 5057.0        & 326.80
    2818 \end{tabular}
    2819 \end{multicols}
    2820 
    2821 
    2822 \paragraph{External Scheduling}
    2823 
    2824 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
    2825 Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}.
    2826 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    2827 
    2828 \begin{multicols}{2}
    2829 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    2830 \vspace*{-16pt}
    2831 \begin{cfa}
    2832 volatile int go = 0;
    2833 @monitor@ M {} m;
    2834 thread T {};
    2835 void __attribute__((noinline))
    2836 do_call( M & @mutex@ ) {}
    2837 void main( T & ) {
    2838         while ( go == 0 ) { yield(); }
    2839         while ( go == 1 ) { do_call( m ); }
    2840 }
    2841 int __attribute__((noinline))
    2842 do_wait( M & @mutex@ m ) {
    2843         go = 1; // continue other thread
    2844         BENCH( for ( N ) { @waitfor( do_call, m );@ } )
    2845         go = 0; // stop other thread
    2846         sout | result`ns;
    2847 }
    2848 int main() {
    2849         T t;
    2850         do_wait( m );
    2851 }
    2852 \end{cfa}
    2853 \captionof{figure}{\CFA external-scheduling benchmark}
    2854 \label{f:ext-sched}
    2855 
    2856 \columnbreak
    2857 
    2858 \vspace*{-16pt}
    2859 \captionof{table}{External-scheduling comparison (nanoseconds)}
    2860 \label{tab:ext-sched}
    2861 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    2862 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2863 \CFA @waitfor@, 1 @monitor@     & 366.7         & 369.5 & 7.52  \\
    2864 \CFA @waitfor@, 2 @monitor@     & 453.6         & 455.8 & 12.38 \\
    2865 \CFA @waitfor@, 4 @monitor@     & 671.6         & 672.4 & 14.16 \\
    2866 \uC @_Accept@                           & 336.0         & 335.8         & 3.22
    2867 \end{tabular}
    2868 \end{multicols}
    2869 
    2870 
    2871 \paragraph{Context-Switching}
    2872 
    2873 In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
    2874 (In many cases, a compiler inlines function calls to eliminate this cost.)
    2875 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
    2876 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
    2877 The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
    2878 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
    2879 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
    2880 
    2881 \begin{multicols}{2}
    2882 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    2883 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    2884 @coroutine@ C {} c;
    2885 void main( C & ) { for ( ;; ) { @suspend;@ } }
    2886 int main() { // coroutine test
    2887         BENCH( for ( N ) { @resume( c );@ } )
    2888         sout | result`ns;
    2889 }
    2890 int main() { // task test
    2891         BENCH( for ( N ) { @yield();@ } )
    2892         sout | result`ns;
    2893 }
    2894 \end{cfa}
    2895 \captionof{figure}{\CFA context-switch benchmark}
    2896 \label{f:ctx-switch}
    2897 
    2898 \columnbreak
    2899 
    2900 \vspace*{-16pt}
    2901 \captionof{table}{Context switch comparison (nanoseconds)}
    2902 \label{tab:ctx-switch}
    2903 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    2904 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2905 C function              & 1.8           & 1.8   & 0             \\
    2906 \CFA generator  & 2.7           & 2.4   & 0.27  \\
    2907 \CFA Coroutine  & 37.8          & 37.7  & 0.22  \\
    2908 \CFA Thread             & 93.6          & 93.8  & 1.46  \\
    2909 \uC Coroutine   & 52.7          & 52.8  & 0.28  \\
    2910 \uC Thread              & 93.4          & 93.7  & 1.04  \\
    2911 Goroutine               & 140.0         & 139.7 & 2.93  \\
    2912 Java Thread             & 374.0         & 375.8 & 10.38 \\
    2913 % Qthreads Thread       & 159.5         & 159.3 & 0.71  \\
    2914 Pthreads Thread & 334.4         & 335.0 & 1.95  \\
    2915 \end{tabular}
    2916 \end{multicols}
    2917 
    2918 
    2919 \paragraph{Mutual-Exclusion}
    2920 
    2921 Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section.
    2922 For monitors, entering and leaving a monitor function is measured.
    2923 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
    2924 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
    2925 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    2926 
    2927 \begin{multicols}{2}
    2928 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    2929 \begin{cfa}
    2930 @monitor@ M {} m1/*, m2, m3, m4*/;
    2931 void __attribute__((noinline))
    2932 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
    2933 int main() {
    2934         BENCH(
    2935                 for( N ) do_call( m1/*, m2, m3, m4*/ );
    2936         )
    2937         sout | result`ns;
    2938 }
    2939 \end{cfa}
    2940 \captionof{figure}{\CFA acquire/release mutex benchmark}
    2941 \label{f:mutex}
    2942 
    2943 \columnbreak
    2944 
    2945 \vspace*{-16pt}
    2946 \captionof{table}{Mutex comparison (nanoseconds)}
    2947 \label{tab:mutex}
    2948 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    2949 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2950 test and test-and-test lock             & 19.1  & 19.0  & 0.36  \\
    2951 \CFA @mutex@ function, 1 arg.   & 46.6  & 46.8  & 0.86  \\
    2952 \CFA @mutex@ function, 2 arg.   & 84.1  & 85.3  & 1.86  \\
    2953 \CFA @mutex@ function, 4 arg.   & 158.6 & 160.7 & 3.07  \\
    2954 \uC @monitor@ member rtn.               & 54.0  & 53.7  & 0.83  \\
    2955 Java synchronized method                & 27.0  & 27.1  & 0.25  \\
    2956 Pthreads Mutex Lock                             & 33.6  & 32.7  & 1.12
     2963\CFA @signal@, 1 @monitor@      & 372.6         & 374.3         & 14.17         \\
     2964\CFA @signal@, 2 @monitor@      & 492.7         & 494.1         & 12.99         \\
     2965\CFA @signal@, 4 @monitor@      & 749.4         & 750.4         & 24.74         \\
     2966\uC @signal@                            & 320.5         & 321.0         & 3.36          \\
     2967Java @notify@                           & 10160.5       & 10169.4       & 267.71        \\
     2968Pthreads Cond. Variable         & 4949.6        & 5065.2        & 363
    29572969\end{tabular}
    29582970\end{multicols}
Note: See TracChangeset for help on using the changeset viewer.