Changes in / [3253c32:b58affe7]
- File:
-
- 1 edited
-
doc/papers/concurrency/Paper.tex (modified) (5 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r3253c32 rb58affe7 2705 2705 \label{results} 2706 2706 2707 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with pthreads, Java OpenJDK-9, Go 1.12.6 and \uC 7.0.0. 2708 For comparison, the package must be multi-processor (M:N), which excludes libdill/libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing. 2709 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \CFA/\uC are compiled with gcc 6.5. 2707 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0. 2708 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \uC/\CFA are compiled with gcc 6.5. 2709 2710 \begin{comment} 2711 \begin{table} 2712 \centering 2713 \caption{Experiment environment} 2714 \label{t:machine} 2715 2716 \begin{tabular}{|l|r||l|r|} 2717 \hline 2718 Architecture & x86\_64 & NUMA node(s) & 8 \\ 2719 \hline 2720 CPU op-mode(s) & 32-bit, 64-bit & Model name & AMD Opteron\texttrademark\ Processor 6380 \\ 2721 \hline 2722 Byte Order & Little Endian & CPU Freq & 2.5 GHz \\ 2723 \hline 2724 CPU(s) & 64 & L1d cache & 16 KiB \\ 2725 \hline 2726 Thread(s) per core & 2 & L1i cache & 64 KiB \\ 2727 \hline 2728 Core(s) per socket & 8 & L2 cache & 2048 KiB \\ 2729 \hline 2730 Socket(s) & 4 & L3 cache & 6144 KiB \\ 2731 \hline 2732 \hline 2733 Operating system & Ubuntu 16.04.3 LTS & Kernel & Linux 4.4-97-generic \\ 2734 \hline 2735 gcc & 6.3 & \CFA & 1.0.0 \\ 2736 \hline 2737 Java & OpenJDK-9 & Go & 1.9.2 \\ 2738 \hline 2739 \end{tabular} 2740 \end{table} 2741 \end{comment} 2710 2742 2711 2743 All benchmarks are run using the following harness. … … 2717 2749 Each benchmark is performed @N@ times, where @N@ varies depending on the benchmark; 2718 2750 the total time is divided by @N@ to obtain the average time for a benchmark. 2719 Each benchmark experiment is run 31 times.2720 2751 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallBenchMarks}. 2721 2752 … … 2748 2779 \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} 2749 2780 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2750 \CFA Coroutine Lazy & 14.3 & 14.3 & 0.32 \\ 2751 \CFA Coroutine Eager & 2203.7 & 2205.6 & 26.03 \\ 2752 \CFA Thread & 1257.8 & 1291.2 & 86.19 \\ 2753 \uC Coroutine & 92.2 & 91.4 & 1.58 \\ 2754 \uC Thread & 499.5 & 500.1 & 5.67 \\ 2755 Goroutine & 4397.0 & 4362.8 & 390.77 \\ 2756 Java Thread & 107405.0 & 107794.8 & 1601.33 \\ 2757 % Qthreads & 159.9 & 159.6 & 0.73 \\ 2758 Pthreads & 32920.9 & 32882.7 & 213.55 2781 Pthreads & 28091 & 28073.39 & 163.1 \\ 2782 \CFA Coroutine Lazy & 6 & 6.07 & 0.26 \\ 2783 \CFA Coroutine Eager & 520 & 520.61 & 2.04 \\ 2784 \CFA Thread & 2032 & 2016.29 & 112.07 \\ 2785 \uC Coroutine & 106 & 107.36 & 1.47 \\ 2786 \uC Thread & 536.5 & 537.07 & 4.64 \\ 2787 Goroutine & 3103 & 3086.29 & 90.25 \\ 2788 Java Thread & 103416.5 & 103732.29 & 1137 \\ 2789 \end{tabular} 2790 \end{multicols} 2791 2792 2793 \paragraph{Context-Switching} 2794 2795 In procedural programming, the cost of a function call is important as modularization (refactoring) increases. 2796 (In many cases, a compiler inlines function calls to eliminate this cost.) 2797 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor. 2798 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches. 2799 The thread test is using yield to enter and return from the runtime kernel, which is two context switches. 2800 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. 2801 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}. 2802 2803 \begin{multicols}{2} 2804 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2805 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 2806 @coroutine@ C {} c; 2807 void main( C & ) { for ( ;; ) { @suspend;@ } } 2808 int main() { // coroutine test 2809 BENCH( for ( N ) { @resume( c );@ } ) 2810 sout | result`ns; 2811 } 2812 int main() { // task test 2813 BENCH( for ( N ) { @yield();@ } ) 2814 sout | result`ns; 2815 } 2816 \end{cfa} 2817 \captionof{figure}{\CFA context-switch benchmark} 2818 \label{f:ctx-switch} 2819 2820 \columnbreak 2821 2822 \vspace*{-16pt} 2823 \captionof{table}{Context switch comparison (nanoseconds)} 2824 \label{tab:ctx-switch} 2825 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2826 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2827 C function & 2 & 2 & 0 \\ 2828 \CFA generator & 2 & 2 & 0 \\ 2829 \CFA Coroutine & 49 & 48.68 & 0.47 \\ 2830 \CFA Thread & 105 & 105.57 & 1.37 \\ 2831 \uC Coroutine & 44 & 44 & 0 \\ 2832 \uC Thread & 100 & 99.29 & 0.96 \\ 2833 Goroutine & 145 & 147.25 & 4.15 \\ 2834 Java Thread & 373.5 & 375.14 & 8.72 \\ 2835 Pthreads Thread & 333.5 & 332.96 & 4.1 2836 \end{tabular} 2837 \end{multicols} 2838 2839 2840 \paragraph{Mutual-Exclusion} 2841 2842 Uncontented mutual exclusion, which occurs frequently, is measured by entering/leaving a critical section. 2843 For monitors, entering and leaving a monitor function is measured. 2844 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. 2845 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. 2846 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2847 2848 \begin{multicols}{2} 2849 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2850 \begin{cfa} 2851 @monitor@ M {} m1/*, m2, m3, m4*/; 2852 void __attribute__((noinline)) 2853 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {} 2854 int main() { 2855 BENCH( 2856 for( N ) do_call( m1/*, m2, m3, m4*/ ); 2857 ) 2858 sout | result`ns; 2859 } 2860 \end{cfa} 2861 \captionof{figure}{\CFA acquire/release mutex benchmark} 2862 \label{f:mutex} 2863 2864 \columnbreak 2865 2866 \vspace*{-16pt} 2867 \captionof{table}{Mutex comparison (nanoseconds)} 2868 \label{tab:mutex} 2869 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2870 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2871 test and test-and-test lock & 26 & 26 & 0 \\ 2872 Pthreads Mutex Lock & 31 & 31.71 & 0.97 \\ 2873 \uC @monitor@ member rtn. & 31 & 31 & 0 \\ 2874 \CFA @mutex@ function, 1 arg. & 46 & 46.68 & 0.93 \\ 2875 \CFA @mutex@ function, 2 arg. & 84 & 85.36 & 1.99 \\ 2876 \CFA @mutex@ function, 4 arg. & 158 & 161 & 4.22 \\ 2877 Java synchronized method & 27.5 & 29.79 & 2.93 2759 2878 \end{tabular} 2760 2879 \end{multicols} … … 2804 2923 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}} 2805 2924 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2806 \CFA @signal@, 1 @monitor@ & 367.0 & 371.5 & 17.34\\2807 \ CFA @signal@, 2 @monitor@ & 477.2 & 478.6 & 8.31\\2808 \CFA @signal@, 4 @monitor@ & 725.8 & 734.0 & 17.98\\2809 \ uC @signal@ & 322.8 & 323.0 & 3.64\\2810 Java @notify@ & 16520.0 & 20096.7 & 9378.53\\2811 Pthreads Cond. Variable & 4931.3 & 5057.0 & 326.80 2925 Pthreads Cond. Variable & 6005 & 5681.43 & 835.45 \\ 2926 \uC @signal@ & 324 & 325.54 & 3.02 \\ 2927 \CFA @signal@, 1 @monitor@ & 368.5 & 370.61 & 4.77 \\ 2928 \CFA @signal@, 2 @monitor@ & 467 & 470.5 & 6.79 \\ 2929 \CFA @signal@, 4 @monitor@ & 700.5 & 702.46 & 7.23 \\ 2930 Java @notify@ & 15471 & 172511 & 5689 2812 2931 \end{tabular} 2813 2932 \end{multicols} … … 2855 2974 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2856 2975 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2857 \CFA @waitfor@, 1 @monitor@ & 366.7 & 369.5 & 7.52 \\ 2858 \CFA @waitfor@, 2 @monitor@ & 453.6 & 455.8 & 12.38 \\ 2859 \CFA @waitfor@, 4 @monitor@ & 671.6 & 672.4 & 14.16 \\ 2860 \uC @_Accept@ & 336.0 & 335.8 & 3.22 2861 \end{tabular} 2862 \end{multicols} 2863 2864 2865 \paragraph{Context-Switching} 2866 2867 In procedural programming, the cost of a function call is important as modularization (refactoring) increases. 2868 (In many cases, a compiler inlines function calls to eliminate this cost.) 2869 Similarly, when modularization extends to coroutines/tasks, the time for a context switch becomes a relevant factor. 2870 The coroutine test is from resumer to suspender and from suspender to resumer, which is two context switches. 2871 The thread test is using yield to enter and return from the runtime kernel, which is two context switches. 2872 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. 2873 Figure~\ref{f:ctx-switch} only shows the \CFA code for coroutines/threads (other systems are similar) with all results in Table~\ref{tab:ctx-switch}. 2874 2875 \begin{multicols}{2} 2876 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2877 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 2878 @coroutine@ C {} c; 2879 void main( C & ) { for ( ;; ) { @suspend;@ } } 2880 int main() { // coroutine test 2881 BENCH( for ( N ) { @resume( c );@ } ) 2882 sout | result`ns; 2883 } 2884 int main() { // task test 2885 BENCH( for ( N ) { @yield();@ } ) 2886 sout | result`ns; 2887 } 2888 \end{cfa} 2889 \captionof{figure}{\CFA context-switch benchmark} 2890 \label{f:ctx-switch} 2891 2892 \columnbreak 2893 2894 \vspace*{-16pt} 2895 \captionof{table}{Context switch comparison (nanoseconds)} 2896 \label{tab:ctx-switch} 2897 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2898 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2899 C function & 1.8 & 1.8 & 0 \\ 2900 \CFA generator & 2.7 & 2.4 & 0.27 \\ 2901 \CFA Coroutine & 37.8 & 37.7 & 0.22 \\ 2902 \CFA Thread & 93.6 & 93.8 & 1.46 \\ 2903 \uC Coroutine & 52.7 & 52.8 & 0.28 \\ 2904 \uC Thread & 93.4 & 93.7 & 1.04 \\ 2905 Goroutine & 140.0 & 139.7 & 2.93 \\ 2906 Java Thread & 374.0 & 375.8 & 10.38 \\ 2907 % Qthreads Thread & 159.5 & 159.3 & 0.71 \\ 2908 Pthreads Thread & 334.4 & 335.0 & 1.95 \\ 2909 \end{tabular} 2910 \end{multicols} 2911 2912 2913 \paragraph{Mutual-Exclusion} 2914 2915 Uncontented mutual exclusion, which occurs frequently, is measured by entering/leaving a critical section. 2916 For monitors, entering and leaving a monitor function is measured. 2917 To put the results in context, the cost of entering a non-inline function and the cost of acquiring and releasing a @pthread_mutex@ lock is also measured. 2918 Figure~\ref{f:mutex} shows the code for \CFA with all results in Table~\ref{tab:mutex}. 2919 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2920 2921 \begin{multicols}{2} 2922 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2923 \begin{cfa} 2924 @monitor@ M {} m1/*, m2, m3, m4*/; 2925 void __attribute__((noinline)) 2926 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {} 2927 int main() { 2928 BENCH( 2929 for( N ) do_call( m1/*, m2, m3, m4*/ ); 2930 ) 2931 sout | result`ns; 2932 } 2933 \end{cfa} 2934 \captionof{figure}{\CFA acquire/release mutex benchmark} 2935 \label{f:mutex} 2936 2937 \columnbreak 2938 2939 \vspace*{-16pt} 2940 \captionof{table}{Mutex comparison (nanoseconds)} 2941 \label{tab:mutex} 2942 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2943 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2944 test and test-and-test lock & 19.1 & 19.0 & 0.36 \\ 2945 \CFA @mutex@ function, 1 arg. & 46.6 & 46.8 & 0.86 \\ 2946 \CFA @mutex@ function, 2 arg. & 84.1 & 85.3 & 1.86 \\ 2947 \CFA @mutex@ function, 4 arg. & 158.6 & 160.7 & 3.07 \\ 2948 \uC @monitor@ member rtn. & 54.0 & 53.7 & 0.83 \\ 2949 Java synchronized method & 27.0 & 27.1 & 0.25 \\ 2950 Pthreads Mutex Lock & 33.6 & 32.7 & 1.12 2976 \uC @_Accept@ & 358 & 359.11 & 2.53 \\ 2977 \CFA @waitfor@, 1 @monitor@ & 359 & 360.93 & 4.07 \\ 2978 \CFA @waitfor@, 2 @monitor@ & 450 & 449.39 & 6.62 \\ 2979 \CFA @waitfor@, 4 @monitor@ & 652 & 655.64 & 7.73 2951 2980 \end{tabular} 2952 2981 \end{multicols}
Note:
See TracChangeset
for help on using the changeset viewer.