Changeset 5453237
- Timestamp:
- Jul 29, 2019, 1:47:10 PM (5 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- 1d76f8a4, 2385236
- Parents:
- be53b87
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
rbe53b87 r5453237 307 307 In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a \emph{research vehicle} for new typing and control-flow capabilities on top of a highly popular programming language allowing immediate dissemination. 308 308 Within the \CFA framework, new control-flow features are created from scratch because ISO \Celeven defines only a subset of the \CFA extensions, where the overlapping features are concurrency~\cite[\S~7.26]{C11}. 309 However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a fewlocks, which is low-level and error-prone;309 However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and mutex/condition locks, which is low-level and error-prone; 310 310 no high-level language concurrency features are defined. 311 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach .311 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach (possibly because the effort to add concurrency to \CC). 312 312 Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}. 313 313 … … 333 333 334 334 Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary. 335 Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging} (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow. 335 Two concurrency violations of this philosophy are \emph{spurious wakeup} (random wakeup~\cite[\S~8]{Buhr05a}) and \emph{barging}\footnote{ 336 The notion of competitive succession instead of direct handoff, \ie a lock owner releases the lock and an arriving thread acquires it ahead of preexisting waiter threads. 337 } (signals-as-hints~\cite[\S~8]{Buhr05a}), where one is a consequence of the other, \ie once there is spurious wakeup, signals-as-hints follow. 336 338 However, spurious wakeup is \emph{not} a foundational concurrency property~\cite[\S~8]{Buhr05a}, it is a performance design choice. 337 339 Similarly, signals-as-hints are often a performance decision. … … 351 353 We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms. 352 354 The main contributions of this work are: 353 \begin{itemize} 355 \begin{itemize}[topsep=3pt,itemsep=1pt] 354 356 \item 355 357 language-level generators, coroutines and user-level threading, which respect the expectations of C programmers. … … 370 372 \end{itemize} 371 373 374 Section~\ref{s:StatefulFunction} begins advanced control by introducing sequential functions that retain data and execution state between calls, which produces constructs @generator@ and @coroutine@. 375 Section~\ref{s:Concurrency} begins concurrency, or how to create (fork) and destroy (join) a thread, which produces the @thread@ construct. 376 Section~\ref{s:MutualExclusionSynchronization} discusses the two mechanisms to restricted nondeterminism when controlling shared access to resources (mutual exclusion) and timing relationships among threads (synchronization). 377 Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs. 378 Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure (cluster) threads and virtual processors (kernel threads). 379 Section~\ref{s:Performance} uses a series of microbenchmarks to compare \CFA threading with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0. 380 372 381 373 382 \section{Stateful Function} 383 \label{s:StatefulFunction} 374 384 375 385 The stateful function is an old idea~\cite{Conway63,Marlin80} that is new again~\cite{C++20Coroutine19}, where execution is temporarily suspended and later resumed, \eg plugin, device driver, finite-state machine. … … 617 627 Figure~\ref{f:CFibonacciSim} shows the C implementation of the \CFA generator only needs one additional field, @next@, to handle retention of execution state. 618 628 The computed @goto@ at the start of the generator main, which branches after the previous suspend, adds very little cost to the resume call. 619 Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types. 629 Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.\footnote{ 630 The \CFA operator syntax uses \lstinline|?| to denote operands, which allows precise definitions for pre, post, and infix operators, \eg \lstinline|++?|, \lstinline|?++|, and \lstinline|?+?|, in addition \lstinline|?\{\}| denotes a constructor, as in \lstinline|foo `f` = `\{`...`\}`|, \lstinline|^?\{\}| denotes a destructor, and \lstinline|?()| is \CC function call \lstinline|operator()|. 631 }% 620 632 \begin{cfa} 621 633 int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$ … … 1511 1523 1512 1524 \section{Mutual Exclusion / Synchronization} 1525 \label{s:MutualExclusionSynchronization} 1513 1526 1514 1527 Unrestricted nondeterminism is meaningless as there is no way to know when the result is completed without synchronization. … … 1551 1564 higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg receive-specific versus receive-any thread in message passing or offering specialized solutions, \eg barrier lock. 1552 1565 Often synchronization is used to order access to a critical section, \eg ensuring a waiting writer thread enters the critical section before a calling reader thread. 1553 If the calling reader is scheduled before the waiting writer, the reader has \newterm{barged}.1566 If the calling reader is scheduled before the waiting writer, the reader has barged. 1554 1567 Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation). 1555 1568 Preventing or detecting barging is an involved challenge with low-level locks, which is made easier through higher-level constructs. … … 2120 2133 2121 2134 2122 \subsection{ Extended \protect\lstinline@waitfor@}2123 2124 Figure~\ref{f:ExtendedWaitfor} show the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes.2135 \subsection{\texorpdfstring{Extended \protect\lstinline@waitfor@}{Extended waitfor}} 2136 2137 Figure~\ref{f:ExtendedWaitfor} shows the extended form of the @waitfor@ statement to conditionally accept one of a group of mutex functions, with an optional statement to be performed \emph{after} the mutex function finishes. 2125 2138 For a @waitfor@ clause to be executed, its @when@ must be true and an outstanding call to its corresponding member(s) must exist. 2126 2139 The \emph{conditional-expression} of a @when@ may call a function, but the function must not block or context switch. … … 2131 2144 Hence, the terminating @else@ clause allows a conditional attempt to accept a call without blocking. 2132 2145 If both @timeout@ and @else@ clause are present, the @else@ must be conditional, or the @timeout@ is never triggered. 2146 There is also a traditional future wait queue (not shown) (\eg Microsoft (@WaitForMultipleObjects@)), to wait for a specified number of future elements in the queue. 2133 2147 2134 2148 \begin{figure} … … 2355 2369 2356 2370 2357 \subsection{\ protect\lstinline@mutex@ Threads}2371 \subsection{\texorpdfstring{\protect\lstinline@mutex@ Threads}{mutex Threads}} 2358 2372 2359 2373 Threads in \CFA can also be monitors to allow \emph{direct communication} among threads, \ie threads can have mutex functions that are called by other threads. … … 2499 2513 \renewcommand{\arraystretch}{1.25} 2500 2514 %\setlength{\tabcolsep}{5pt} 2501 \begin{tabular}{c|c| l|l}2502 \multicolumn{2}{c| }{object properties} & \multicolumn{2}{c}{mutual exclusion} \\2515 \begin{tabular}{c|c||l|l} 2516 \multicolumn{2}{c||}{object properties} & \multicolumn{2}{c}{mutual exclusion} \\ 2503 2517 \hline 2504 2518 thread & stateful & \multicolumn{1}{c|}{No} & \multicolumn{1}{c}{Yes} \\ … … 2605 2619 2606 2620 2607 \section{ \protect\CFARuntime Structure}2621 \section{Runtime Structure} 2608 2622 \label{s:CFARuntimeStructure} 2609 2623 … … 2709 2723 2710 2724 \section{Performance} 2711 \label{ results}2725 \label{s:Performance} 2712 2726 2713 2727 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0. … … 2715 2729 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \CFA/\uC are compiled with gcc 6.5. 2716 2730 2717 All benchmarks are run using the following harness. 2731 All benchmarks are run using the following harness. (The Java harness is augmented to circumvent JIT issues.) 2718 2732 \begin{cfa} 2719 2733 unsigned int N = 10_000_000; … … 2754 2768 \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} 2755 2769 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2756 \CFA Coroutine Lazy & 14.3 & 14.3 & 0.32 \\ 2757 \CFA Coroutine Eager & 522.8 & 525.3 & 5.81 \\ 2758 \CFA Thread & 1257.8 & 1291.2 & 86.19 \\ 2759 \uC Coroutine & 92.2 & 91.4 & 1.58 \\ 2760 \uC Thread & 499.5 & 500.1 & 5.67 \\ 2761 Goroutine & 4397.0 & 4362.8 & 390.77 \\ 2762 Java Thread & 107405.0 & 107794.8 & 1601.33 \\ 2763 % Qthreads & 159.9 & 159.6 & 0.73 \\ 2764 Pthreads & 32920.9 & 32882.7 & 213.55 2770 \CFA Coroutine Lazy & 13.2 & 13.1 & 0.44 \\ 2771 \CFA Coroutine Eager & 531.3 & 536.0 & 26.54 \\ 2772 \CFA Thread & 2074.9 & 2066.5 & 170.76 \\ 2773 \uC Coroutine & 89.6 & 90.5 & 1.83 \\ 2774 \uC Thread & 528.2 & 528.5 & 4.94 \\ 2775 Goroutine & 4068.0 & 4113.1 & 414.55 \\ 2776 Java Thread & 103848.5 & 104295.4 & 2637.57 \\ 2777 Pthreads & 33112.6 & 33127.1 & 165.90 2778 \end{tabular} 2779 \end{multicols} 2780 2781 2782 \paragraph{Context-Switching} 2783 2784 In procedural programming, the cost of a function call is important as modularization (refactoring) increases. 2785 (In many cases, a compiler inlines function calls to eliminate this cost.) 2786 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor. 2787 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches. 2788 The thread test is using yield to enter and return from the runtime kernel, which is two context switches. 2789 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. 2790 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}. 2791 2792 \begin{multicols}{2} 2793 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2794 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 2795 @coroutine@ C {} c; 2796 void main( C & ) { for ( ;; ) { @suspend;@ } } 2797 int main() { // coroutine test 2798 BENCH( for ( N ) { @resume( c );@ } ) 2799 sout | result`ns; 2800 } 2801 int main() { // task test 2802 BENCH( for ( N ) { @yield();@ } ) 2803 sout | result`ns; 2804 } 2805 \end{cfa} 2806 \captionof{figure}{\CFA context-switch benchmark} 2807 \label{f:ctx-switch} 2808 2809 \columnbreak 2810 2811 \vspace*{-16pt} 2812 \captionof{table}{Context switch comparison (nanoseconds)} 2813 \label{tab:ctx-switch} 2814 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2815 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2816 C function & 1.8 & 1.8 & 0.01 \\ 2817 \CFA generator & 2.4 & 2.2 & 0.25 \\ 2818 \CFA Coroutine & 36.2 & 36.2 & 0.25 \\ 2819 \CFA Thread & 93.2 & 93.5 & 2.09 \\ 2820 \uC Coroutine & 52.0 & 52.1 & 0.51 \\ 2821 \uC Thread & 96.2 & 96.3 & 0.58 \\ 2822 Goroutine & 141.0 & 141.3 & 3.39 \\ 2823 Java Thread & 374.0 & 375.8 & 10.38 \\ 2824 Pthreads Thread & 361.0 & 365.3 & 13.19 2825 \end{tabular} 2826 \end{multicols} 2827 2828 2829 \paragraph{Mutual-Exclusion} 2830 2831 Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section. 2832 For monitors, entering and leaving a monitor function is measured. 2833 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. 2834 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. 2835 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2836 2837 \begin{multicols}{2} 2838 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2839 \begin{cfa} 2840 @monitor@ M {} m1/*, m2, m3, m4*/; 2841 void __attribute__((noinline)) 2842 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {} 2843 int main() { 2844 BENCH( 2845 for( N ) do_call( m1/*, m2, m3, m4*/ ); 2846 ) 2847 sout | result`ns; 2848 } 2849 \end{cfa} 2850 \captionof{figure}{\CFA acquire/release mutex benchmark} 2851 \label{f:mutex} 2852 2853 \columnbreak 2854 2855 \vspace*{-16pt} 2856 \captionof{table}{Mutex comparison (nanoseconds)} 2857 \label{tab:mutex} 2858 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2859 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2860 test and test-and-test lock & 19.1 & 18.9 & 0.40 \\ 2861 \CFA @mutex@ function, 1 arg. & 45.9 & 46.6 & 1.45 \\ 2862 \CFA @mutex@ function, 2 arg. & 105.0 & 104.7 & 3.08 \\ 2863 \CFA @mutex@ function, 4 arg. & 165.0 & 167.6 & 5.65 \\ 2864 \uC @monitor@ member rtn. & 54.0 & 53.7 & 0.82 \\ 2865 Java synchronized method & 31.0 & 31.1 & 0.50 \\ 2866 Pthreads Mutex Lock & 33.6 & 32.6 & 1.14 2867 \end{tabular} 2868 \end{multicols} 2869 2870 2871 \paragraph{External Scheduling} 2872 2873 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement. 2874 Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}. 2875 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2876 2877 \begin{multicols}{2} 2878 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2879 \vspace*{-16pt} 2880 \begin{cfa} 2881 volatile int go = 0; 2882 @monitor@ M {} m; 2883 thread T {}; 2884 void __attribute__((noinline)) 2885 do_call( M & @mutex@ ) {} 2886 void main( T & ) { 2887 while ( go == 0 ) { yield(); } 2888 while ( go == 1 ) { do_call( m ); } 2889 } 2890 int __attribute__((noinline)) 2891 do_wait( M & @mutex@ m ) { 2892 go = 1; // continue other thread 2893 BENCH( for ( N ) { @waitfor( do_call, m );@ } ) 2894 go = 0; // stop other thread 2895 sout | result`ns; 2896 } 2897 int main() { 2898 T t; 2899 do_wait( m ); 2900 } 2901 \end{cfa} 2902 \captionof{figure}{\CFA external-scheduling benchmark} 2903 \label{f:ext-sched} 2904 2905 \columnbreak 2906 2907 \vspace*{-16pt} 2908 \captionof{table}{External-scheduling comparison (nanoseconds)} 2909 \label{tab:ext-sched} 2910 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2911 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2912 \CFA @waitfor@, 1 @monitor@ & 376.4 & 376.8 & 7.63 \\ 2913 \CFA @waitfor@, 2 @monitor@ & 491.4 & 492.0 & 13.31 \\ 2914 \CFA @waitfor@, 4 @monitor@ & 681.0 & 681.7 & 19.10 \\ 2915 \uC @_Accept@ & 331.1 & 331.4 & 2.66 2765 2916 \end{tabular} 2766 2917 \end{multicols} … … 2810 2961 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}} 2811 2962 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2812 \CFA @signal@, 1 @monitor@ & 367.0 & 371.5 & 17.34 \\ 2813 \CFA @signal@, 2 @monitor@ & 477.2 & 478.6 & 8.31 \\ 2814 \CFA @signal@, 4 @monitor@ & 725.8 & 734.0 & 17.98 \\ 2815 \uC @signal@ & 322.8 & 323.0 & 3.64 \\ 2816 Java @notify@ & 16520.0 & 20096.7 & 9378.53 \\ 2817 Pthreads Cond. Variable & 4931.3 & 5057.0 & 326.80 2818 \end{tabular} 2819 \end{multicols} 2820 2821 2822 \paragraph{External Scheduling} 2823 2824 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement. 2825 Figure~\ref{f:ext-sched} shows the code for \CFA, with results in Table~\ref{tab:ext-sched}. 2826 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2827 2828 \begin{multicols}{2} 2829 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2830 \vspace*{-16pt} 2831 \begin{cfa} 2832 volatile int go = 0; 2833 @monitor@ M {} m; 2834 thread T {}; 2835 void __attribute__((noinline)) 2836 do_call( M & @mutex@ ) {} 2837 void main( T & ) { 2838 while ( go == 0 ) { yield(); } 2839 while ( go == 1 ) { do_call( m ); } 2840 } 2841 int __attribute__((noinline)) 2842 do_wait( M & @mutex@ m ) { 2843 go = 1; // continue other thread 2844 BENCH( for ( N ) { @waitfor( do_call, m );@ } ) 2845 go = 0; // stop other thread 2846 sout | result`ns; 2847 } 2848 int main() { 2849 T t; 2850 do_wait( m ); 2851 } 2852 \end{cfa} 2853 \captionof{figure}{\CFA external-scheduling benchmark} 2854 \label{f:ext-sched} 2855 2856 \columnbreak 2857 2858 \vspace*{-16pt} 2859 \captionof{table}{External-scheduling comparison (nanoseconds)} 2860 \label{tab:ext-sched} 2861 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2862 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2863 \CFA @waitfor@, 1 @monitor@ & 366.7 & 369.5 & 7.52 \\ 2864 \CFA @waitfor@, 2 @monitor@ & 453.6 & 455.8 & 12.38 \\ 2865 \CFA @waitfor@, 4 @monitor@ & 671.6 & 672.4 & 14.16 \\ 2866 \uC @_Accept@ & 336.0 & 335.8 & 3.22 2867 \end{tabular} 2868 \end{multicols} 2869 2870 2871 \paragraph{Context-Switching} 2872 2873 In procedural programming, the cost of a function call is important as modularization (refactoring) increases. 2874 (In many cases, a compiler inlines function calls to eliminate this cost.) 2875 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor. 2876 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches. 2877 The thread test is using yield to enter and return from the runtime kernel, which is two context switches. 2878 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. 2879 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}. 2880 2881 \begin{multicols}{2} 2882 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2883 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 2884 @coroutine@ C {} c; 2885 void main( C & ) { for ( ;; ) { @suspend;@ } } 2886 int main() { // coroutine test 2887 BENCH( for ( N ) { @resume( c );@ } ) 2888 sout | result`ns; 2889 } 2890 int main() { // task test 2891 BENCH( for ( N ) { @yield();@ } ) 2892 sout | result`ns; 2893 } 2894 \end{cfa} 2895 \captionof{figure}{\CFA context-switch benchmark} 2896 \label{f:ctx-switch} 2897 2898 \columnbreak 2899 2900 \vspace*{-16pt} 2901 \captionof{table}{Context switch comparison (nanoseconds)} 2902 \label{tab:ctx-switch} 2903 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2904 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2905 C function & 1.8 & 1.8 & 0 \\ 2906 \CFA generator & 2.7 & 2.4 & 0.27 \\ 2907 \CFA Coroutine & 37.8 & 37.7 & 0.22 \\ 2908 \CFA Thread & 93.6 & 93.8 & 1.46 \\ 2909 \uC Coroutine & 52.7 & 52.8 & 0.28 \\ 2910 \uC Thread & 93.4 & 93.7 & 1.04 \\ 2911 Goroutine & 140.0 & 139.7 & 2.93 \\ 2912 Java Thread & 374.0 & 375.8 & 10.38 \\ 2913 % Qthreads Thread & 159.5 & 159.3 & 0.71 \\ 2914 Pthreads Thread & 334.4 & 335.0 & 1.95 \\ 2915 \end{tabular} 2916 \end{multicols} 2917 2918 2919 \paragraph{Mutual-Exclusion} 2920 2921 Uncontented mutual exclusion, which frequently occurs, is measured by entering/leaving a critical section. 2922 For monitors, entering and leaving a monitor function is measured. 2923 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. 2924 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. 2925 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2926 2927 \begin{multicols}{2} 2928 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2929 \begin{cfa} 2930 @monitor@ M {} m1/*, m2, m3, m4*/; 2931 void __attribute__((noinline)) 2932 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {} 2933 int main() { 2934 BENCH( 2935 for( N ) do_call( m1/*, m2, m3, m4*/ ); 2936 ) 2937 sout | result`ns; 2938 } 2939 \end{cfa} 2940 \captionof{figure}{\CFA acquire/release mutex benchmark} 2941 \label{f:mutex} 2942 2943 \columnbreak 2944 2945 \vspace*{-16pt} 2946 \captionof{table}{Mutex comparison (nanoseconds)} 2947 \label{tab:mutex} 2948 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2949 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2950 test and test-and-test lock & 19.1 & 19.0 & 0.36 \\ 2951 \CFA @mutex@ function, 1 arg. & 46.6 & 46.8 & 0.86 \\ 2952 \CFA @mutex@ function, 2 arg. & 84.1 & 85.3 & 1.86 \\ 2953 \CFA @mutex@ function, 4 arg. & 158.6 & 160.7 & 3.07 \\ 2954 \uC @monitor@ member rtn. & 54.0 & 53.7 & 0.83 \\ 2955 Java synchronized method & 27.0 & 27.1 & 0.25 \\ 2956 Pthreads Mutex Lock & 33.6 & 32.7 & 1.12 2963 \CFA @signal@, 1 @monitor@ & 372.6 & 374.3 & 14.17 \\ 2964 \CFA @signal@, 2 @monitor@ & 492.7 & 494.1 & 12.99 \\ 2965 \CFA @signal@, 4 @monitor@ & 749.4 & 750.4 & 24.74 \\ 2966 \uC @signal@ & 320.5 & 321.0 & 3.36 \\ 2967 Java @notify@ & 10160.5 & 10169.4 & 267.71 \\ 2968 Pthreads Cond. Variable & 4949.6 & 5065.2 & 363 2957 2969 \end{tabular} 2958 2970 \end{multicols}
Note: See TracChangeset
for help on using the changeset viewer.