Changeset d052a2c


Ignore:
Timestamp:
Oct 6, 2020, 12:05:02 PM (13 months ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
arm-eh, jacob/cs343-translation, master, new-ast-unique-expr
Children:
c6391e6
Parents:
6b6b9ba
Message:

hopefully final changes for paper

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r6b6b9ba rd052a2c  
    224224{}
    225225\lstnewenvironment{C++}[1][]                            % use C++ style
    226 {\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     226{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    227227{}
    228228\lstnewenvironment{uC++}[1][]
    229 {\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     229{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    230230{}
    231231\lstnewenvironment{Go}[1][]
    232 {\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     232{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    233233{}
    234234\lstnewenvironment{python}[1][]
    235 {\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     235{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    236236{}
    237237\lstnewenvironment{java}[1][]
    238 {\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     238{\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    239239{}
    240240
     
    284284
    285285\begin{document}
    286 \linenumbers                            % comment out to turn off line numbering
     286%\linenumbers                           % comment out to turn off line numbering
    287287
    288288\maketitle
     
    28962896\label{s:RuntimeStructureCluster}
    28972897
    2898 A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue.
     2898A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue~\cite{Buhr90a}.
    28992899The term \newterm{virtual processor} is introduced as a synonym for kernel thread to disambiguate between user and kernel thread.
    29002900From the language perspective, a virtual processor is an actual processor (core).
     
    29922992\end{cfa}
    29932993where CPU time in nanoseconds is from the appropriate language clock.
    2994 Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language.
     2994Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language;
     2995each @N@ appears after the experiment name in the following tables.
    29952996The total time is divided by @N@ to obtain the average time for a benchmark.
    29962997Each benchmark experiment is run 13 times and the average appears in the table.
     2998For languages with a runtime JIT (Java, Node.js, Python), a single half-hour long experiment is run to check stability;
     2999all long-experiment results are statistically equivalent, \ie median/average/standard-deviation correlate with the short-experiment results, indicating the short experiments reached a steady state.
    29973000All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}.
    29983001% tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark
     
    30063009
    30073010\begin{multicols}{2}
    3008 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3009 \begin{cfa}
    3010 @coroutine@ MyCoroutine {};
     3011\begin{cfa}[xleftmargin=0pt]
     3012`coroutine` MyCoroutine {};
    30113013void ?{}( MyCoroutine & this ) {
    30123014#ifdef EAGER
     
    30163018void main( MyCoroutine & ) {}
    30173019int main() {
    3018         BENCH( for ( N ) { @MyCoroutine c;@ } )
     3020        BENCH( for ( N ) { `MyCoroutine c;` } )
    30193021        sout | result;
    30203022}
     
    30303032
    30313033\begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
    3032 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3033 \CFA generator                  & 0.6           & 0.6           & 0.0           \\
    3034 \CFA coroutine lazy             & 13.4          & 13.1          & 0.5           \\
    3035 \CFA coroutine eager    & 144.7         & 143.9         & 1.5           \\
    3036 \CFA thread                             & 466.4         & 468.0         & 11.3          \\
    3037 \uC coroutine                   & 155.6         & 155.7         & 1.7           \\
    3038 \uC thread                              & 523.4         & 523.9         & 7.7           \\
    3039 Python generator                & 123.2         & 124.3         & 4.1           \\
    3040 Node.js generator               & 33.4          & 33.5          & 0.3           \\
    3041 Goroutine thread                & 751.0         & 750.5         & 3.1           \\
    3042 Rust tokio thread               & 1860.0        & 1881.1        & 37.6          \\
    3043 Rust thread                             & 53801.0       & 53896.8       & 274.9         \\
    3044 Java thread (   10 000)         & 119256.0      & 119679.2      & 2244.0        \\
    3045 Java thread (1 000 000)         & 123100.0      & 123052.5      & 751.6         \\
    3046 Pthreads thread                 & 31465.5       & 31419.5       & 140.4
     3034\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3035\CFA generator (1B)                     & 0.6           & 0.6           & 0.0           \\
     3036\CFA coroutine lazy     (100M)  & 13.4          & 13.1          & 0.5           \\
     3037\CFA coroutine eager (10M)      & 144.7         & 143.9         & 1.5           \\
     3038\CFA thread (10M)                       & 466.4         & 468.0         & 11.3          \\
     3039\uC coroutine (10M)                     & 155.6         & 155.7         & 1.7           \\
     3040\uC thread (10M)                        & 523.4         & 523.9         & 7.7           \\
     3041Python generator (10M)          & 123.2         & 124.3         & 4.1           \\
     3042Node.js generator (10M)         & 33.4          & 33.5          & 0.3           \\
     3043Goroutine thread (10M)          & 751.0         & 750.5         & 3.1           \\
     3044Rust tokio thread (10M)         & 1860.0        & 1881.1        & 37.6          \\
     3045Rust thread     (250K)                  & 53801.0       & 53896.8       & 274.9         \\
     3046Java thread (250K)                      & 119256.0      & 119679.2      & 2244.0        \\
     3047% Java thread (1 000 000)               & 123100.0      & 123052.5      & 751.6         \\
     3048Pthreads thread (250K)          & 31465.5       & 31419.5       & 140.4
    30473049\end{tabular}
    30483050\end{multicols}
     
    30533055Internal scheduling is measured using a cycle of two threads signalling and waiting.
    30543056Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
    3055 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3056 Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking.
     3057Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
     3058User-level threading has one kernel thread, eliminating contention between the threads (direct handoff of the kernel thread).
     3059Kernel-level threading has two kernel threads allowing some contention.
    30573060
    30583061\begin{multicols}{2}
    3059 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3060 \begin{cfa}
     3062\setlength{\tabcolsep}{3pt}
     3063\begin{cfa}[xleftmargin=0pt]
    30613064volatile int go = 0;
    3062 @condition c;@
    3063 @monitor@ M {} m1/*, m2, m3, m4*/;
    3064 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3065         @signal( c );@
    3066 }
    3067 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
     3065`condition c;`
     3066`monitor` M {} m1/*, m2, m3, m4*/;
     3067void call( M & `mutex p1/*, p2, p3, p4*/` ) {
     3068        `signal( c );`
     3069}
     3070void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
    30683071        go = 1; // continue other thread
    3069         for ( N ) { @wait( c );@ } );
     3072        for ( N ) { `wait( c );` } );
    30703073}
    30713074thread T {};
     
    30923095
    30933096\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
    3094 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3095 \CFA @signal@, 1 monitor        & 364.4         & 364.2         & 4.4           \\
    3096 \CFA @signal@, 2 monitor        & 484.4         & 483.9         & 8.8           \\
    3097 \CFA @signal@, 4 monitor        & 709.1         & 707.7         & 15.0          \\
    3098 \uC @signal@ monitor            & 328.3         & 327.4         & 2.4           \\
    3099 Rust cond. variable                     & 7514.0        & 7437.4        & 397.2         \\
    3100 Java @notify@ monitor (  1 000 000)             & 8717.0        & 8774.1        & 471.8         \\
    3101 Java @notify@ monitor (100 000 000)             & 8634.0        & 8683.5        & 330.5         \\
    3102 Pthreads cond. variable         & 5553.7        & 5576.1        & 345.6
     3097\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3098\CFA @signal@, 1 monitor (10M)  & 364.4         & 364.2         & 4.4           \\
     3099\CFA @signal@, 2 monitor (10M)  & 484.4         & 483.9         & 8.8           \\
     3100\CFA @signal@, 4 monitor (10M)  & 709.1         & 707.7         & 15.0          \\
     3101\uC @signal@ monitor (10M)              & 328.3         & 327.4         & 2.4           \\
     3102Rust cond. variable     (1M)            & 7514.0        & 7437.4        & 397.2         \\
     3103Java @notify@ monitor (1M)              & 8717.0        & 8774.1        & 471.8         \\
     3104% Java @notify@ monitor (100 000 000)           & 8634.0        & 8683.5        & 330.5         \\
     3105Pthreads cond. variable (1M)    & 5553.7        & 5576.1        & 345.6
    31033106\end{tabular}
    31043107\end{multicols}
     
    31093112External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
    31103113Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
    3111 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     3114Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
    31123115
    31133116\begin{multicols}{2}
    3114 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     3117\setlength{\tabcolsep}{5pt}
    31153118\vspace*{-16pt}
    3116 \begin{cfa}
    3117 @monitor@ M {} m1/*, m2, m3, m4*/;
    3118 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
    3119 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3120         for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
     3119\begin{cfa}[xleftmargin=0pt]
     3120`monitor` M {} m1/*, m2, m3, m4*/;
     3121void call( M & `mutex p1/*, p2, p3, p4*/` ) {}
     3122void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
     3123        for ( N ) { `waitfor( call : p1/*, p2, p3, p4*/ );` }
    31213124}
    31223125thread T {};
     
    31353138\columnbreak
    31363139
    3137 \vspace*{-16pt}
     3140\vspace*{-18pt}
    31383141\captionof{table}{External-scheduling comparison (nanoseconds)}
    31393142\label{t:schedext}
    31403143\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3141 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3142 \CFA @waitfor@, 1 monitor       & 367.1 & 365.3 & 5.0   \\
    3143 \CFA @waitfor@, 2 monitor       & 463.0 & 464.6 & 7.1   \\
    3144 \CFA @waitfor@, 4 monitor       & 689.6 & 696.2 & 21.5  \\
    3145 \uC \lstinline[language=uC++]|_Accept| monitor  & 328.2 & 329.1 & 3.4   \\
    3146 Go \lstinline[language=Golang]|select| channel  & 365.0 & 365.5 & 1.2
     3144\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3145\CFA @waitfor@, 1 monitor (10M) & 367.1 & 365.3 & 5.0   \\
     3146\CFA @waitfor@, 2 monitor (10M) & 463.0 & 464.6 & 7.1   \\
     3147\CFA @waitfor@, 4 monitor (10M) & 689.6 & 696.2 & 21.5  \\
     3148\uC \lstinline[language=uC++]|_Accept| monitor (10M)    & 328.2 & 329.1 & 3.4   \\
     3149Go \lstinline[language=Golang]|select| channel (10M)    & 365.0 & 365.5 & 1.2
    31473150\end{tabular}
    31483151\end{multicols}
     
    31573160
    31583161\begin{multicols}{2}
    3159 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3160 \begin{cfa}
    3161 @monitor@ M {} m1/*, m2, m3, m4*/;
    3162 call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
     3162\setlength{\tabcolsep}{3pt}
     3163\begin{cfa}[xleftmargin=0pt]
     3164`monitor` M {} m1/*, m2, m3, m4*/;
     3165call( M & `mutex p1/*, p2, p3, p4*/` ) {}
    31633166int main() {
    31643167        BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
     
    31753178\label{t:mutex}
    31763179\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3177 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3178 test-and-test-set lock                  & 19.1  & 18.9  & 0.4   \\
    3179 \CFA @mutex@ function, 1 arg.   & 48.3  & 47.8  & 0.9   \\
    3180 \CFA @mutex@ function, 2 arg.   & 86.7  & 87.6  & 1.9   \\
    3181 \CFA @mutex@ function, 4 arg.   & 173.4 & 169.4 & 5.9   \\
    3182 \uC @monitor@ member rtn.               & 54.8  & 54.8  & 0.1   \\
    3183 Goroutine mutex lock                    & 34.0  & 34.0  & 0.0   \\
    3184 Rust mutex lock                                 & 33.0  & 33.2  & 0.8   \\
    3185 Java synchronized method (   100 000 000)               & 31.0  & 30.9  & 0.5   \\
    3186 Java synchronized method (10 000 000 000)               & 31.0 & 30.2 & 0.9 \\
    3187 Pthreads mutex Lock                             & 31.0  & 31.1  & 0.4
     3180\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3181test-and-test-set lock (50M)            & 19.1  & 18.9  & 0.4   \\
     3182\CFA @mutex@ function, 1 arg. (50M)     & 48.3  & 47.8  & 0.9   \\
     3183\CFA @mutex@ function, 2 arg. (50M)     & 86.7  & 87.6  & 1.9   \\
     3184\CFA @mutex@ function, 4 arg. (50M)     & 173.4 & 169.4 & 5.9   \\
     3185\uC @monitor@ member rtn. (50M)         & 54.8  & 54.8  & 0.1   \\
     3186Goroutine mutex lock (50M)                      & 34.0  & 34.0  & 0.0   \\
     3187Rust mutex lock (50M)                           & 33.0  & 33.2  & 0.8   \\
     3188Java synchronized method (50M)          & 31.0  & 30.9  & 0.5   \\
     3189% Java synchronized method (10 000 000 000)             & 31.0 & 30.2 & 0.9 \\
     3190Pthreads mutex Lock (50M)                       & 31.0  & 31.1  & 0.4
    31883191\end{tabular}
    31893192\end{multicols}
     
    32143217
    32153218\begin{multicols}{2}
    3216 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3217 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    3218 @coroutine@ C {};
    3219 void main( C & ) { for () { @suspend;@ } }
     3219\begin{cfa}[xleftmargin=0pt]
     3220`coroutine` C {};
     3221void main( C & ) { for () { `suspend;` } }
    32203222int main() { // coroutine test
    32213223        C c;
    3222         BENCH( for ( N ) { @resume( c );@ } )
     3224        BENCH( for ( N ) { `resume( c );` } )
    32233225        sout | result;
    32243226}
    32253227int main() { // thread test
    3226         BENCH( for ( N ) { @yield();@ } )
     3228        BENCH( for ( N ) { `yield();` } )
    32273229        sout | result;
    32283230}
     
    32373239\label{t:ctx-switch}
    32383240\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3239 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3240 C function                      & 1.8           & 1.8           & 0.0   \\
    3241 \CFA generator          & 1.8           & 2.0           & 0.3   \\
    3242 \CFA coroutine          & 32.5          & 32.9          & 0.8   \\
    3243 \CFA thread                     & 93.8          & 93.6          & 2.2   \\
    3244 \uC coroutine           & 50.3          & 50.3          & 0.2   \\
    3245 \uC thread                      & 97.3          & 97.4          & 1.0   \\
    3246 Python generator        & 40.9          & 41.3          & 1.5   \\
    3247 Node.js await           & 1852.2        & 1854.7        & 16.4  \\
    3248 Node.js generator       & 33.3          & 33.4          & 0.3   \\
    3249 Goroutine thread        & 143.0         & 143.3         & 1.1   \\
    3250 Rust async await        & 32.0          & 32.0          & 0.0   \\
    3251 Rust tokio thread       & 143.0         & 143.0         & 1.7   \\
    3252 Rust thread                     & 332.0         & 331.4         & 2.4   \\
    3253 Java thread     (      100 000)         & 405.0         & 415.0         & 17.6  \\
    3254 Java thread (  100 000 000)                     & 413.0 & 414.2 & 6.2 \\
    3255 Java thread (5 000 000 000)                     & 415.0 & 415.2 & 6.1 \\
    3256 Pthreads thread         & 334.3         & 335.2         & 3.9
     3241\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3242C function (10B)                        & 1.8           & 1.8           & 0.0   \\
     3243\CFA generator (5B)                     & 1.8           & 2.0           & 0.3   \\
     3244\CFA coroutine (100M)           & 32.5          & 32.9          & 0.8   \\
     3245\CFA thread (100M)                      & 93.8          & 93.6          & 2.2   \\
     3246\uC coroutine (100M)            & 50.3          & 50.3          & 0.2   \\
     3247\uC thread (100M)                       & 97.3          & 97.4          & 1.0   \\
     3248Python generator (100M)         & 40.9          & 41.3          & 1.5   \\
     3249Node.js await (5M)                      & 1852.2        & 1854.7        & 16.4  \\
     3250Node.js generator (100M)        & 33.3          & 33.4          & 0.3   \\
     3251Goroutine thread (100M)         & 143.0         & 143.3         & 1.1   \\
     3252Rust async await (100M)         & 32.0          & 32.0          & 0.0   \\
     3253Rust tokio thread (100M)        & 143.0         & 143.0         & 1.7   \\
     3254Rust thread (25M)                       & 332.0         & 331.4         & 2.4   \\
     3255Java thread (100M)                      & 405.0         & 415.0         & 17.6  \\
     3256% Java thread (  100 000 000)                   & 413.0 & 414.2 & 6.2 \\
     3257% Java thread (5 000 000 000)                   & 415.0 & 415.2 & 6.1 \\
     3258Pthreads thread (25M)           & 334.3         & 335.2         & 3.9
    32573259\end{tabular}
    32583260\end{multicols}
     
    32633265Languages using 1:1 threading based on pthreads can at best meet or exceed, due to language overhead, the pthread results.
    32643266Note, pthreads has a fast zero-contention mutex lock checked in user space.
    3265 Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions.
     3267Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions (context-switching or locking).
     3268As well, for locking experiments, M:N threading has less contention if only one kernel thread is used.
    32663269Languages with stackful coroutines have higher cost than stackless coroutines because of stack allocation and context switching;
    32673270however, stackful \uC and \CFA coroutines have approximately the same performance as stackless Python and Node.js generators.
    32683271The \CFA stackless generator is approximately 25 times faster for suspend/resume and 200 times faster for creation than stackless Python and Node.js generators.
     3272The Node.js context-switch is costly when asynchronous await must enter the event engine because a promise is not fulfilled.
     3273Finally, the benchmark results correlate across programming languages with and without JIT, indicating the JIT has completed any runtime optimizations.
    32693274
    32703275
     
    33243329
    33253330The authors recognize the design assistance of Aaron Moss, Rob Schluntz, Andrew Beach, and Michael Brooks; David Dice for commenting and helping with the Java benchmarks; and Gregor Richards for helping with the Node.js benchmarks.
    3326 This research is funded by a grant from Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
     3331This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
    33273332
    33283333{%
Note: See TracChangeset for help on using the changeset viewer.