Changes in / [3253c32:b58affe7]


Ignore:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r3253c32 rb58affe7  
    27052705\label{results}
    27062706
    2707 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0.
    2708 For comparison, the package must be multi-processor (M:N), which excludes libdill/libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.
    2709 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \CFA/\uC are compiled with gcc 6.5.
     2707To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0.
     2708The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \uC/\CFA are compiled with gcc 6.5.
     2709
     2710\begin{comment}
     2711\begin{table}
     2712\centering
     2713\caption{Experiment environment}
     2714\label{t:machine}
     2715
     2716\begin{tabular}{|l|r||l|r|}
     2717\hline
     2718Architecture            & x86\_64                               & NUMA node(s)  & 8 \\
     2719\hline
     2720CPU op-mode(s)          & 32-bit, 64-bit                & Model name    & AMD Opteron\texttrademark\ Processor 6380 \\
     2721\hline
     2722Byte Order                      & Little Endian                 & CPU Freq              & 2.5 GHz \\
     2723\hline
     2724CPU(s)                          & 64                                    & L1d cache     & 16 KiB \\
     2725\hline
     2726Thread(s) per core      & 2                                     & L1i cache     & 64 KiB \\
     2727\hline
     2728Core(s) per socket      & 8                                     & L2 cache              & 2048 KiB \\
     2729\hline
     2730Socket(s)                       & 4                                     & L3 cache              & 6144 KiB \\
     2731\hline
     2732\hline
     2733Operating system        & Ubuntu 16.04.3 LTS    & Kernel                & Linux 4.4-97-generic \\
     2734\hline
     2735gcc                                     & 6.3                                   & \CFA                  & 1.0.0 \\
     2736\hline
     2737Java                            & OpenJDK-9                     & Go                    & 1.9.2 \\
     2738\hline
     2739\end{tabular}
     2740\end{table}
     2741\end{comment}
    27102742
    27112743All benchmarks are run using the following harness.
     
    27172749Each benchmark is performed @N@ times, where @N@ varies depending on the benchmark;
    27182750the total time is divided by @N@ to obtain the average time for a benchmark.
    2719 Each benchmark experiment is run 31 times.
    27202751All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallBenchMarks}.
    27212752
     
    27482779\begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
    27492780\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2750 \CFA Coroutine Lazy             & 14.3          & 14.3          & 0.32          \\
    2751 \CFA Coroutine Eager    & 2203.7        & 2205.6        & 26.03         \\
    2752 \CFA Thread                             & 1257.8        & 1291.2        & 86.19         \\
    2753 \uC Coroutine                   & 92.2          & 91.4          & 1.58          \\
    2754 \uC Thread                              & 499.5         & 500.1         & 5.67          \\
    2755 Goroutine                               & 4397.0        & 4362.8        & 390.77        \\
    2756 Java Thread                             & 107405.0      & 107794.8      & 1601.33       \\
    2757 % Qthreads                              & 159.9         & 159.6         & 0.73          \\
    2758 Pthreads                                & 32920.9       & 32882.7       & 213.55
     2781Pthreads                                & 28091         & 28073.39      & 163.1         \\
     2782\CFA Coroutine Lazy             & 6                     & 6.07          & 0.26          \\
     2783\CFA Coroutine Eager    & 520           & 520.61        & 2.04          \\
     2784\CFA Thread                             & 2032          & 2016.29       & 112.07        \\
     2785\uC Coroutine                   & 106           & 107.36        & 1.47          \\
     2786\uC Thread                              & 536.5         & 537.07        & 4.64          \\
     2787Goroutine                               & 3103          & 3086.29       & 90.25         \\
     2788Java Thread                             & 103416.5      & 103732.29     & 1137          \\
     2789\end{tabular}
     2790\end{multicols}
     2791
     2792
     2793\paragraph{Context-Switching}
     2794
     2795In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
     2796(In many cases, a compiler inlines function calls to eliminate this cost.)
     2797Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
     2798The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
     2799The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
     2800The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
     2801Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
     2802
     2803\begin{multicols}{2}
     2804\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     2805\begin{cfa}[aboveskip=0pt,belowskip=0pt]
     2806@coroutine@ C {} c;
     2807void main( C & ) { for ( ;; ) { @suspend;@ } }
     2808int main() { // coroutine test
     2809        BENCH( for ( N ) { @resume( c );@ } )
     2810        sout | result`ns;
     2811}
     2812int main() { // task test
     2813        BENCH( for ( N ) { @yield();@ } )
     2814        sout | result`ns;
     2815}
     2816\end{cfa}
     2817\captionof{figure}{\CFA context-switch benchmark}
     2818\label{f:ctx-switch}
     2819
     2820\columnbreak
     2821
     2822\vspace*{-16pt}
     2823\captionof{table}{Context switch comparison (nanoseconds)}
     2824\label{tab:ctx-switch}
     2825\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     2826\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     2827C function              & 2                     & 2             & 0             \\
     2828\CFA generator  & 2                     & 2             & 0             \\
     2829\CFA Coroutine  & 49    & 48.68         & 0.47  \\
     2830\CFA Thread             & 105   & 105.57        & 1.37  \\
     2831\uC Coroutine   & 44    & 44            & 0             \\
     2832\uC Thread              & 100   & 99.29         & 0.96  \\
     2833Goroutine               & 145   & 147.25        & 4.15  \\
     2834Java Thread             & 373.5 & 375.14        & 8.72  \\
     2835Pthreads Thread & 333.5 & 332.96        & 4.1
     2836\end{tabular}
     2837\end{multicols}
     2838
     2839
     2840\paragraph{Mutual-Exclusion}
     2841
     2842Uncontented mutual exclusion, which occurs frequently, is measured by entering/leaving a critical section.
     2843For monitors, entering and leaving a monitor function is measured.
     2844To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
     2845Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
     2846Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     2847
     2848\begin{multicols}{2}
     2849\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     2850\begin{cfa}
     2851@monitor@ M {} m1/*, m2, m3, m4*/;
     2852void __attribute__((noinline))
     2853do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
     2854int main() {
     2855        BENCH(
     2856                for( N ) do_call( m1/*, m2, m3, m4*/ );
     2857        )
     2858        sout | result`ns;
     2859}
     2860\end{cfa}
     2861\captionof{figure}{\CFA acquire/release mutex benchmark}
     2862\label{f:mutex}
     2863
     2864\columnbreak
     2865
     2866\vspace*{-16pt}
     2867\captionof{table}{Mutex comparison (nanoseconds)}
     2868\label{tab:mutex}
     2869\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     2870\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     2871test and test-and-test lock             & 26            & 26    & 0             \\
     2872Pthreads Mutex Lock                             & 31            & 31.71 & 0.97  \\
     2873\uC @monitor@ member rtn.               & 31            & 31    & 0             \\
     2874\CFA @mutex@ function, 1 arg.   & 46            & 46.68 & 0.93  \\
     2875\CFA @mutex@ function, 2 arg.   & 84            & 85.36 & 1.99  \\
     2876\CFA @mutex@ function, 4 arg.   & 158           & 161   & 4.22  \\
     2877Java synchronized method                & 27.5          & 29.79 & 2.93
    27592878\end{tabular}
    27602879\end{multicols}
     
    28042923\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
    28052924\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2806 \CFA @signal@, 1 @monitor@      & 367.0         & 371.5         & 17.34         \\
    2807 \CFA @signal@, 2 @monitor@      & 477.2         & 478.6         & 8.31          \\
    2808 \CFA @signal@, 4 @monitor@      & 725.8         & 734.0         & 17.98         \\
    2809 \uC @signal@                            & 322.8         & 323.0         & 3.64          \\
    2810 Java @notify@                           & 16520.0       & 20096.7       & 9378.53       \\
    2811 Pthreads Cond. Variable         & 4931.3        & 5057.0        & 326.80
     2925Pthreads Cond. Variable         & 6005          & 5681.43       & 835.45        \\
     2926\uC @signal@                            & 324           & 325.54        & 3.02          \\
     2927\CFA @signal@, 1 @monitor@      & 368.5         & 370.61        & 4.77          \\
     2928\CFA @signal@, 2 @monitor@      & 467           & 470.5         & 6.79          \\
     2929\CFA @signal@, 4 @monitor@      & 700.5         & 702.46        & 7.23          \\
     2930Java @notify@                           & 15471         & 172511        & 5689
    28122931\end{tabular}
    28132932\end{multicols}
     
    28552974\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    28562975\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2857 \CFA @waitfor@, 1 @monitor@     & 366.7         & 369.5 & 7.52  \\
    2858 \CFA @waitfor@, 2 @monitor@     & 453.6         & 455.8 & 12.38 \\
    2859 \CFA @waitfor@, 4 @monitor@     & 671.6         & 672.4 & 14.16 \\
    2860 \uC @_Accept@                           & 336.0         & 335.8         & 3.22
    2861 \end{tabular}
    2862 \end{multicols}
    2863 
    2864 
    2865 \paragraph{Context-Switching}
    2866 
    2867 In procedural programming, the cost of a function call is important as modularization (refactoring) increases.
    2868 (In many cases, a compiler inlines function calls to eliminate this cost.)
    2869 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor.
    2870 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches.
    2871 The thread test is using yield to enter and return from the runtime kernel, which is two context switches.
    2872 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling.
    2873 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}.
    2874 
    2875 \begin{multicols}{2}
    2876 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    2877 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    2878 @coroutine@ C {} c;
    2879 void main( C & ) { for ( ;; ) { @suspend;@ } }
    2880 int main() { // coroutine test
    2881         BENCH( for ( N ) { @resume( c );@ } )
    2882         sout | result`ns;
    2883 }
    2884 int main() { // task test
    2885         BENCH( for ( N ) { @yield();@ } )
    2886         sout | result`ns;
    2887 }
    2888 \end{cfa}
    2889 \captionof{figure}{\CFA context-switch benchmark}
    2890 \label{f:ctx-switch}
    2891 
    2892 \columnbreak
    2893 
    2894 \vspace*{-16pt}
    2895 \captionof{table}{Context switch comparison (nanoseconds)}
    2896 \label{tab:ctx-switch}
    2897 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    2898 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2899 C function              & 1.8           & 1.8   & 0             \\
    2900 \CFA generator  & 2.7           & 2.4   & 0.27  \\
    2901 \CFA Coroutine  & 37.8          & 37.7  & 0.22  \\
    2902 \CFA Thread             & 93.6          & 93.8  & 1.46  \\
    2903 \uC Coroutine   & 52.7          & 52.8  & 0.28  \\
    2904 \uC Thread              & 93.4          & 93.7  & 1.04  \\
    2905 Goroutine               & 140.0         & 139.7 & 2.93  \\
    2906 Java Thread             & 374.0         & 375.8 & 10.38 \\
    2907 % Qthreads Thread       & 159.5         & 159.3 & 0.71  \\
    2908 Pthreads Thread & 334.4         & 335.0 & 1.95  \\
    2909 \end{tabular}
    2910 \end{multicols}
    2911 
    2912 
    2913 \paragraph{Mutual-Exclusion}
    2914 
    2915 Uncontented mutual exclusion, which occurs frequently, is measured by entering/leaving a critical section.
    2916 For monitors, entering and leaving a monitor function is measured.
    2917 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured.
    2918 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}.
    2919 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    2920 
    2921 \begin{multicols}{2}
    2922 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    2923 \begin{cfa}
    2924 @monitor@ M {} m1/*, m2, m3, m4*/;
    2925 void __attribute__((noinline))
    2926 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
    2927 int main() {
    2928         BENCH(
    2929                 for( N ) do_call( m1/*, m2, m3, m4*/ );
    2930         )
    2931         sout | result`ns;
    2932 }
    2933 \end{cfa}
    2934 \captionof{figure}{\CFA acquire/release mutex benchmark}
    2935 \label{f:mutex}
    2936 
    2937 \columnbreak
    2938 
    2939 \vspace*{-16pt}
    2940 \captionof{table}{Mutex comparison (nanoseconds)}
    2941 \label{tab:mutex}
    2942 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    2943 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2944 test and test-and-test lock             & 19.1  & 19.0  & 0.36  \\
    2945 \CFA @mutex@ function, 1 arg.   & 46.6  & 46.8  & 0.86  \\
    2946 \CFA @mutex@ function, 2 arg.   & 84.1  & 85.3  & 1.86  \\
    2947 \CFA @mutex@ function, 4 arg.   & 158.6 & 160.7 & 3.07  \\
    2948 \uC @monitor@ member rtn.               & 54.0  & 53.7  & 0.83  \\
    2949 Java synchronized method                & 27.0  & 27.1  & 0.25  \\
    2950 Pthreads Mutex Lock                             & 33.6  & 32.7  & 1.12
     2976\uC @_Accept@                           & 358           & 359.11        & 2.53          \\
     2977\CFA @waitfor@, 1 @monitor@     & 359           & 360.93        & 4.07          \\
     2978\CFA @waitfor@, 2 @monitor@     & 450           & 449.39        & 6.62          \\
     2979\CFA @waitfor@, 4 @monitor@     & 652           & 655.64        & 7.73
    29512980\end{tabular}
    29522981\end{multicols}
Note: See TracChangeset for help on using the changeset viewer.