Changeset d052a2c
- Timestamp:
- Oct 6, 2020, 12:05:02 PM (4 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- c6391e6
- Parents:
- 6b6b9ba
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r6b6b9ba rd052a2c 224 224 {} 225 225 \lstnewenvironment{C++}[1][] % use C++ style 226 {\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`} ,#1}\lstset{#1}}226 {\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}} 227 227 {} 228 228 \lstnewenvironment{uC++}[1][] 229 {\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`} ,#1}\lstset{#1}}229 {\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}} 230 230 {} 231 231 \lstnewenvironment{Go}[1][] 232 {\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`} ,#1}\lstset{#1}}232 {\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}} 233 233 {} 234 234 \lstnewenvironment{python}[1][] 235 {\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`} ,#1}\lstset{#1}}235 {\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}} 236 236 {} 237 237 \lstnewenvironment{java}[1][] 238 {\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`} ,#1}\lstset{#1}}238 {\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}} 239 239 {} 240 240 … … 284 284 285 285 \begin{document} 286 \linenumbers % comment out to turn off line numbering286 %\linenumbers % comment out to turn off line numbering 287 287 288 288 \maketitle … … 2896 2896 \label{s:RuntimeStructureCluster} 2897 2897 2898 A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue .2898 A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue~\cite{Buhr90a}. 2899 2899 The term \newterm{virtual processor} is introduced as a synonym for kernel thread to disambiguate between user and kernel thread. 2900 2900 From the language perspective, a virtual processor is an actual processor (core). … … 2992 2992 \end{cfa} 2993 2993 where CPU time in nanoseconds is from the appropriate language clock. 2994 Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language. 2994 Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language; 2995 each @N@ appears after the experiment name in the following tables. 2995 2996 The total time is divided by @N@ to obtain the average time for a benchmark. 2996 2997 Each benchmark experiment is run 13 times and the average appears in the table. 2998 For languages with a runtime JIT (Java, Node.js, Python), a single half-hour long experiment is run to check stability; 2999 all long-experiment results are statistically equivalent, \ie median/average/standard-deviation correlate with the short-experiment results, indicating the short experiments reached a steady state. 2997 3000 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}. 2998 3001 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark … … 3006 3009 3007 3010 \begin{multicols}{2} 3008 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 3009 \begin{cfa} 3010 @coroutine@ MyCoroutine {}; 3011 \begin{cfa}[xleftmargin=0pt] 3012 `coroutine` MyCoroutine {}; 3011 3013 void ?{}( MyCoroutine & this ) { 3012 3014 #ifdef EAGER … … 3016 3018 void main( MyCoroutine & ) {} 3017 3019 int main() { 3018 BENCH( for ( N ) { @MyCoroutine c;@} )3020 BENCH( for ( N ) { `MyCoroutine c;` } ) 3019 3021 sout | result; 3020 3022 } … … 3030 3032 3031 3033 \begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}} 3032 \multicolumn{1}{@{} c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\3033 \CFA generator & 0.6 & 0.6 & 0.0 \\3034 \CFA coroutine lazy & 13.4 & 13.1 & 0.5 \\3035 \CFA coroutine eager & 144.7 & 143.9 & 1.5 \\3036 \CFA thread 3037 \uC coroutine & 155.6 & 155.7 & 1.7 \\3038 \uC thread 3039 Python generator & 123.2 & 124.3 & 4.1 \\3040 Node.js generator & 33.4 & 33.5 & 0.3 \\3041 Goroutine thread & 751.0 & 750.5 & 3.1 \\3042 Rust tokio thread & 1860.0 & 1881.1 & 37.6 \\3043 Rust thread & 53801.0 & 53896.8 & 274.9 \\3044 Java thread ( 10 000)& 119256.0 & 119679.2 & 2244.0 \\3045 Java thread (1 000 000) & 123100.0 & 123052.5 & 751.6 \\3046 Pthreads thread & 31465.5 & 31419.5 & 140.43034 \multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 3035 \CFA generator (1B) & 0.6 & 0.6 & 0.0 \\ 3036 \CFA coroutine lazy (100M) & 13.4 & 13.1 & 0.5 \\ 3037 \CFA coroutine eager (10M) & 144.7 & 143.9 & 1.5 \\ 3038 \CFA thread (10M) & 466.4 & 468.0 & 11.3 \\ 3039 \uC coroutine (10M) & 155.6 & 155.7 & 1.7 \\ 3040 \uC thread (10M) & 523.4 & 523.9 & 7.7 \\ 3041 Python generator (10M) & 123.2 & 124.3 & 4.1 \\ 3042 Node.js generator (10M) & 33.4 & 33.5 & 0.3 \\ 3043 Goroutine thread (10M) & 751.0 & 750.5 & 3.1 \\ 3044 Rust tokio thread (10M) & 1860.0 & 1881.1 & 37.6 \\ 3045 Rust thread (250K) & 53801.0 & 53896.8 & 274.9 \\ 3046 Java thread (250K) & 119256.0 & 119679.2 & 2244.0 \\ 3047 % Java thread (1 000 000) & 123100.0 & 123052.5 & 751.6 \\ 3048 Pthreads thread (250K) & 31465.5 & 31419.5 & 140.4 3047 3049 \end{tabular} 3048 3050 \end{multicols} … … 3053 3055 Internal scheduling is measured using a cycle of two threads signalling and waiting. 3054 3056 Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}. 3055 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 3056 Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking. 3057 Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects. 3058 User-level threading has one kernel thread, eliminating contention between the threads (direct handoff of the kernel thread). 3059 Kernel-level threading has two kernel threads allowing some contention. 3057 3060 3058 3061 \begin{multicols}{2} 3059 \ lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}3060 \begin{cfa} 3062 \setlength{\tabcolsep}{3pt} 3063 \begin{cfa}[xleftmargin=0pt] 3061 3064 volatile int go = 0; 3062 @condition c;@ 3063 @monitor@M {} m1/*, m2, m3, m4*/;3064 void call( M & @mutex p1/*, p2, p3, p4*/@) {3065 @signal( c );@3066 } 3067 void wait( M & @mutex p1/*, p2, p3, p4*/@) {3065 `condition c;` 3066 `monitor` M {} m1/*, m2, m3, m4*/; 3067 void call( M & `mutex p1/*, p2, p3, p4*/` ) { 3068 `signal( c );` 3069 } 3070 void wait( M & `mutex p1/*, p2, p3, p4*/` ) { 3068 3071 go = 1; // continue other thread 3069 for ( N ) { @wait( c );@} );3072 for ( N ) { `wait( c );` } ); 3070 3073 } 3071 3074 thread T {}; … … 3092 3095 3093 3096 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}} 3094 \multicolumn{1}{@{} c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\3095 \CFA @signal@, 1 monitor & 364.4 & 364.2 & 4.4 \\3096 \CFA @signal@, 2 monitor & 484.4 & 483.9 & 8.8 \\3097 \CFA @signal@, 4 monitor & 709.1 & 707.7 & 15.0 \\3098 \uC @signal@ monitor & 328.3 & 327.4 & 2.4 \\3099 Rust cond. variable & 7514.0 & 7437.4 & 397.2 \\3100 Java @notify@ monitor ( 1 000 000) & 8717.0 & 8774.1 & 471.8 \\3101 Java @notify@ monitor (100 000 000) & 8634.0 & 8683.5 & 330.5 \\3102 Pthreads cond. variable 3097 \multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 3098 \CFA @signal@, 1 monitor (10M) & 364.4 & 364.2 & 4.4 \\ 3099 \CFA @signal@, 2 monitor (10M) & 484.4 & 483.9 & 8.8 \\ 3100 \CFA @signal@, 4 monitor (10M) & 709.1 & 707.7 & 15.0 \\ 3101 \uC @signal@ monitor (10M) & 328.3 & 327.4 & 2.4 \\ 3102 Rust cond. variable (1M) & 7514.0 & 7437.4 & 397.2 \\ 3103 Java @notify@ monitor (1M) & 8717.0 & 8774.1 & 471.8 \\ 3104 % Java @notify@ monitor (100 000 000) & 8634.0 & 8683.5 & 330.5 \\ 3105 Pthreads cond. variable (1M) & 5553.7 & 5576.1 & 345.6 3103 3106 \end{tabular} 3104 3107 \end{multicols} … … 3109 3112 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement. 3110 3113 Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}. 3111 Note, the incremental cost of bulk acquire for \CFA, which is largelya fixed cost for small numbers of mutex objects.3114 Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects. 3112 3115 3113 3116 \begin{multicols}{2} 3114 \ lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}3117 \setlength{\tabcolsep}{5pt} 3115 3118 \vspace*{-16pt} 3116 \begin{cfa} 3117 @monitor@M {} m1/*, m2, m3, m4*/;3118 void call( M & @mutex p1/*, p2, p3, p4*/@) {}3119 void wait( M & @mutex p1/*, p2, p3, p4*/@) {3120 for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@}3119 \begin{cfa}[xleftmargin=0pt] 3120 `monitor` M {} m1/*, m2, m3, m4*/; 3121 void call( M & `mutex p1/*, p2, p3, p4*/` ) {} 3122 void wait( M & `mutex p1/*, p2, p3, p4*/` ) { 3123 for ( N ) { `waitfor( call : p1/*, p2, p3, p4*/ );` } 3121 3124 } 3122 3125 thread T {}; … … 3135 3138 \columnbreak 3136 3139 3137 \vspace*{-1 6pt}3140 \vspace*{-18pt} 3138 3141 \captionof{table}{External-scheduling comparison (nanoseconds)} 3139 3142 \label{t:schedext} 3140 3143 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 3141 \multicolumn{1}{@{} c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\3142 \CFA @waitfor@, 1 monitor & 367.1 & 365.3 & 5.0 \\3143 \CFA @waitfor@, 2 monitor & 463.0 & 464.6 & 7.1 \\3144 \CFA @waitfor@, 4 monitor & 689.6 & 696.2 & 21.5 \\3145 \uC \lstinline[language=uC++]|_Accept| monitor & 328.2 & 329.1 & 3.4 \\3146 Go \lstinline[language=Golang]|select| channel & 365.0 & 365.5 & 1.23144 \multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 3145 \CFA @waitfor@, 1 monitor (10M) & 367.1 & 365.3 & 5.0 \\ 3146 \CFA @waitfor@, 2 monitor (10M) & 463.0 & 464.6 & 7.1 \\ 3147 \CFA @waitfor@, 4 monitor (10M) & 689.6 & 696.2 & 21.5 \\ 3148 \uC \lstinline[language=uC++]|_Accept| monitor (10M) & 328.2 & 329.1 & 3.4 \\ 3149 Go \lstinline[language=Golang]|select| channel (10M) & 365.0 & 365.5 & 1.2 3147 3150 \end{tabular} 3148 3151 \end{multicols} … … 3157 3160 3158 3161 \begin{multicols}{2} 3159 \ lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}3160 \begin{cfa} 3161 @monitor@M {} m1/*, m2, m3, m4*/;3162 call( M & @mutex p1/*, p2, p3, p4*/@) {}3162 \setlength{\tabcolsep}{3pt} 3163 \begin{cfa}[xleftmargin=0pt] 3164 `monitor` M {} m1/*, m2, m3, m4*/; 3165 call( M & `mutex p1/*, p2, p3, p4*/` ) {} 3163 3166 int main() { 3164 3167 BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); ) … … 3175 3178 \label{t:mutex} 3176 3179 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 3177 \multicolumn{1}{@{} c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\3178 test-and-test-set lock 3179 \CFA @mutex@ function, 1 arg. & 48.3 & 47.8 & 0.9 \\3180 \CFA @mutex@ function, 2 arg. & 86.7 & 87.6 & 1.9 \\3181 \CFA @mutex@ function, 4 arg. & 173.4 & 169.4 & 5.9 \\3182 \uC @monitor@ member rtn. & 54.8 & 54.8 & 0.1 \\3183 Goroutine mutex lock & 34.0 & 34.0 & 0.0 \\3184 Rust mutex lock 3185 Java synchronized method ( 100 000 000) & 31.0 & 30.9 & 0.5 \\3186 Java synchronized method (10 000 000 000) & 31.0 & 30.2 & 0.9 \\3187 Pthreads mutex Lock 3180 \multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 3181 test-and-test-set lock (50M) & 19.1 & 18.9 & 0.4 \\ 3182 \CFA @mutex@ function, 1 arg. (50M) & 48.3 & 47.8 & 0.9 \\ 3183 \CFA @mutex@ function, 2 arg. (50M) & 86.7 & 87.6 & 1.9 \\ 3184 \CFA @mutex@ function, 4 arg. (50M) & 173.4 & 169.4 & 5.9 \\ 3185 \uC @monitor@ member rtn. (50M) & 54.8 & 54.8 & 0.1 \\ 3186 Goroutine mutex lock (50M) & 34.0 & 34.0 & 0.0 \\ 3187 Rust mutex lock (50M) & 33.0 & 33.2 & 0.8 \\ 3188 Java synchronized method (50M) & 31.0 & 30.9 & 0.5 \\ 3189 % Java synchronized method (10 000 000 000) & 31.0 & 30.2 & 0.9 \\ 3190 Pthreads mutex Lock (50M) & 31.0 & 31.1 & 0.4 3188 3191 \end{tabular} 3189 3192 \end{multicols} … … 3214 3217 3215 3218 \begin{multicols}{2} 3216 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 3217 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 3218 @coroutine@ C {}; 3219 void main( C & ) { for () { @suspend;@ } } 3219 \begin{cfa}[xleftmargin=0pt] 3220 `coroutine` C {}; 3221 void main( C & ) { for () { `suspend;` } } 3220 3222 int main() { // coroutine test 3221 3223 C c; 3222 BENCH( for ( N ) { @resume( c );@} )3224 BENCH( for ( N ) { `resume( c );` } ) 3223 3225 sout | result; 3224 3226 } 3225 3227 int main() { // thread test 3226 BENCH( for ( N ) { @yield();@} )3228 BENCH( for ( N ) { `yield();` } ) 3227 3229 sout | result; 3228 3230 } … … 3237 3239 \label{t:ctx-switch} 3238 3240 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 3239 \multicolumn{1}{@{} c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\3240 C function & 1.8 & 1.8 & 0.0 \\3241 \CFA generator & 1.8 & 2.0 & 0.3 \\3242 \CFA coroutine & 32.5 & 32.9 & 0.8 \\3243 \CFA thread & 93.8 & 93.6 & 2.2 \\3244 \uC coroutine & 50.3 & 50.3 & 0.2 \\3245 \uC thread & 97.3 & 97.4 & 1.0 \\3246 Python generator & 40.9 & 41.3 & 1.5 \\3247 Node.js await & 1852.2 & 1854.7 & 16.4 \\3248 Node.js generator & 33.3 & 33.4 & 0.3 \\3249 Goroutine thread & 143.0 & 143.3 & 1.1 \\3250 Rust async await & 32.0 & 32.0 & 0.0 \\3251 Rust tokio thread & 143.0 & 143.0 & 1.7 \\3252 Rust thread & 332.0 & 331.4 & 2.4 \\3253 Java thread ( 100 000)& 405.0 & 415.0 & 17.6 \\3254 Java thread ( 100 000 000) & 413.0 & 414.2 & 6.2 \\3255 Java thread (5 000 000 000) & 415.0 & 415.2 & 6.1 \\3256 Pthreads thread & 334.3 & 335.2 & 3.93241 \multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 3242 C function (10B) & 1.8 & 1.8 & 0.0 \\ 3243 \CFA generator (5B) & 1.8 & 2.0 & 0.3 \\ 3244 \CFA coroutine (100M) & 32.5 & 32.9 & 0.8 \\ 3245 \CFA thread (100M) & 93.8 & 93.6 & 2.2 \\ 3246 \uC coroutine (100M) & 50.3 & 50.3 & 0.2 \\ 3247 \uC thread (100M) & 97.3 & 97.4 & 1.0 \\ 3248 Python generator (100M) & 40.9 & 41.3 & 1.5 \\ 3249 Node.js await (5M) & 1852.2 & 1854.7 & 16.4 \\ 3250 Node.js generator (100M) & 33.3 & 33.4 & 0.3 \\ 3251 Goroutine thread (100M) & 143.0 & 143.3 & 1.1 \\ 3252 Rust async await (100M) & 32.0 & 32.0 & 0.0 \\ 3253 Rust tokio thread (100M) & 143.0 & 143.0 & 1.7 \\ 3254 Rust thread (25M) & 332.0 & 331.4 & 2.4 \\ 3255 Java thread (100M) & 405.0 & 415.0 & 17.6 \\ 3256 % Java thread ( 100 000 000) & 413.0 & 414.2 & 6.2 \\ 3257 % Java thread (5 000 000 000) & 415.0 & 415.2 & 6.1 \\ 3258 Pthreads thread (25M) & 334.3 & 335.2 & 3.9 3257 3259 \end{tabular} 3258 3260 \end{multicols} … … 3263 3265 Languages using 1:1 threading based on pthreads can at best meet or exceed, due to language overhead, the pthread results. 3264 3266 Note, pthreads has a fast zero-contention mutex lock checked in user space. 3265 Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions. 3267 Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions (context-switching or locking). 3268 As well, for locking experiments, M:N threading has less contention if only one kernel thread is used. 3266 3269 Languages with stackful coroutines have higher cost than stackless coroutines because of stack allocation and context switching; 3267 3270 however, stackful \uC and \CFA coroutines have approximately the same performance as stackless Python and Node.js generators. 3268 3271 The \CFA stackless generator is approximately 25 times faster for suspend/resume and 200 times faster for creation than stackless Python and Node.js generators. 3272 The Node.js context-switch is costly when asynchronous await must enter the event engine because a promise is not fulfilled. 3273 Finally, the benchmark results correlate across programming languages with and without JIT, indicating the JIT has completed any runtime optimizations. 3269 3274 3270 3275 … … 3324 3329 3325 3330 The authors recognize the design assistance of Aaron Moss, Rob Schluntz, Andrew Beach, and Michael Brooks; David Dice for commenting and helping with the Java benchmarks; and Gregor Richards for helping with the Node.js benchmarks. 3326 This research is funded by a grant fromWaterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.3331 This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada. 3327 3332 3328 3333 {%
Note: See TracChangeset
for help on using the changeset viewer.