Changeset 41b8ea4 for doc/papers


Ignore:
Timestamp:
Oct 7, 2020, 5:10:45 PM (5 years ago)
Author:
Fangren Yu <f37yu@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
490fb92e, 69c5c00
Parents:
2fb35df (diff), 848439f (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' of plg.uwaterloo.ca:software/cfa/cfa-cc into master

Location:
doc/papers/concurrency
Files:
1 added
3 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r2fb35df r41b8ea4  
    224224{}
    225225\lstnewenvironment{C++}[1][]                            % use C++ style
    226 {\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     226{\lstset{language=C++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    227227{}
    228228\lstnewenvironment{uC++}[1][]
    229 {\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     229{\lstset{language=uC++,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    230230{}
    231231\lstnewenvironment{Go}[1][]
    232 {\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     232{\lstset{language=Golang,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    233233{}
    234234\lstnewenvironment{python}[1][]
    235 {\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     235{\lstset{language=python,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    236236{}
    237237\lstnewenvironment{java}[1][]
    238 {\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`},#1}\lstset{#1}}
     238{\lstset{language=java,moredelim=**[is][\protect\color{red}]{`}{`}}\lstset{#1}}
    239239{}
    240240
     
    284284
    285285\begin{document}
    286 \linenumbers                            % comment out to turn off line numbering
     286%\linenumbers                           % comment out to turn off line numbering
    287287
    288288\maketitle
     
    28962896\label{s:RuntimeStructureCluster}
    28972897
    2898 A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue.
     2898A \newterm{cluster} is a collection of user and kernel threads, where the kernel threads run the user threads from the cluster's ready queue, and the operating system runs the kernel threads on the processors from its ready queue~\cite{Buhr90a}.
    28992899The term \newterm{virtual processor} is introduced as a synonym for kernel thread to disambiguate between user and kernel thread.
    29002900From the language perspective, a virtual processor is an actual processor (core).
     
    29922992\end{cfa}
    29932993where CPU time in nanoseconds is from the appropriate language clock.
    2994 Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language.
     2994Each benchmark is performed @N@ times, where @N@ is selected so the benchmark runs in the range of 2--20 seconds for the specific programming language;
     2995each @N@ appears after the experiment name in the following tables.
    29952996The total time is divided by @N@ to obtain the average time for a benchmark.
    29962997Each benchmark experiment is run 13 times and the average appears in the table.
     2998For languages with a runtime JIT (Java, Node.js, Python), a single half-hour long experiment is run to check stability;
     2999all long-experiment results are statistically equivalent, \ie median/average/standard-deviation correlate with the short-experiment results, indicating the short experiments reached a steady state.
    29973000All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}.
    2998 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark
    2999 % cp -p benchmark.tar /u/cforall/public_html/doc/concurrent_benchmark.tar
    30003001
    30013002\paragraph{Creation}
     
    30063007
    30073008\begin{multicols}{2}
    3008 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3009 \begin{cfa}
    3010 @coroutine@ MyCoroutine {};
     3009\begin{cfa}[xleftmargin=0pt]
     3010`coroutine` MyCoroutine {};
    30113011void ?{}( MyCoroutine & this ) {
    30123012#ifdef EAGER
     
    30163016void main( MyCoroutine & ) {}
    30173017int main() {
    3018         BENCH( for ( N ) { @MyCoroutine c;@ } )
     3018        BENCH( for ( N ) { `MyCoroutine c;` } )
    30193019        sout | result;
    30203020}
     
    30303030
    30313031\begin{tabular}[t]{@{}r*{3}{D{.}{.}{5.2}}@{}}
    3032 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3033 \CFA generator                  & 0.6           & 0.6           & 0.0           \\
    3034 \CFA coroutine lazy             & 13.4          & 13.1          & 0.5           \\
    3035 \CFA coroutine eager    & 144.7         & 143.9         & 1.5           \\
    3036 \CFA thread                             & 466.4         & 468.0         & 11.3          \\
    3037 \uC coroutine                   & 155.6         & 155.7         & 1.7           \\
    3038 \uC thread                              & 523.4         & 523.9         & 7.7           \\
    3039 Python generator                & 123.2         & 124.3         & 4.1           \\
    3040 Node.js generator               & 33.4          & 33.5          & 0.3           \\
    3041 Goroutine thread                & 751.0         & 750.5         & 3.1           \\
    3042 Rust tokio thread               & 1860.0        & 1881.1        & 37.6          \\
    3043 Rust thread                             & 53801.0       & 53896.8       & 274.9         \\
    3044 Java thread (   10 000)         & 119256.0      & 119679.2      & 2244.0        \\
    3045 Java thread (1 000 000)         & 123100.0      & 123052.5      & 751.6         \\
    3046 Pthreads thread                 & 31465.5       & 31419.5       & 140.4
     3032\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3033\CFA generator (1B)                     & 0.6           & 0.6           & 0.0           \\
     3034\CFA coroutine lazy     (100M)  & 13.4          & 13.1          & 0.5           \\
     3035\CFA coroutine eager (10M)      & 144.7         & 143.9         & 1.5           \\
     3036\CFA thread (10M)                       & 466.4         & 468.0         & 11.3          \\
     3037\uC coroutine (10M)                     & 155.6         & 155.7         & 1.7           \\
     3038\uC thread (10M)                        & 523.4         & 523.9         & 7.7           \\
     3039Python generator (10M)          & 123.2         & 124.3         & 4.1           \\
     3040Node.js generator (10M)         & 33.4          & 33.5          & 0.3           \\
     3041Goroutine thread (10M)          & 751.0         & 750.5         & 3.1           \\
     3042Rust tokio thread (10M)         & 1860.0        & 1881.1        & 37.6          \\
     3043Rust thread     (250K)                  & 53801.0       & 53896.8       & 274.9         \\
     3044Java thread (250K)                      & 119256.0      & 119679.2      & 2244.0        \\
     3045% Java thread (1 000 000)               & 123100.0      & 123052.5      & 751.6         \\
     3046Pthreads thread (250K)          & 31465.5       & 31419.5       & 140.4
    30473047\end{tabular}
    30483048\end{multicols}
     
    30533053Internal scheduling is measured using a cycle of two threads signalling and waiting.
    30543054Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
    3055 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3056 Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking.
     3055Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
     3056User-level threading has one kernel thread, eliminating contention between the threads (direct handoff of the kernel thread).
     3057Kernel-level threading has two kernel threads allowing some contention.
    30573058
    30583059\begin{multicols}{2}
    3059 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3060 \begin{cfa}
     3060\setlength{\tabcolsep}{3pt}
     3061\begin{cfa}[xleftmargin=0pt]
    30613062volatile int go = 0;
    3062 @condition c;@
    3063 @monitor@ M {} m1/*, m2, m3, m4*/;
    3064 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3065         @signal( c );@
    3066 }
    3067 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
     3063`condition c;`
     3064`monitor` M {} m1/*, m2, m3, m4*/;
     3065void call( M & `mutex p1/*, p2, p3, p4*/` ) {
     3066        `signal( c );`
     3067}
     3068void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
    30683069        go = 1; // continue other thread
    3069         for ( N ) { @wait( c );@ } );
     3070        for ( N ) { `wait( c );` } );
    30703071}
    30713072thread T {};
     
    30923093
    30933094\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
    3094 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3095 \CFA @signal@, 1 monitor        & 364.4         & 364.2         & 4.4           \\
    3096 \CFA @signal@, 2 monitor        & 484.4         & 483.9         & 8.8           \\
    3097 \CFA @signal@, 4 monitor        & 709.1         & 707.7         & 15.0          \\
    3098 \uC @signal@ monitor            & 328.3         & 327.4         & 2.4           \\
    3099 Rust cond. variable                     & 7514.0        & 7437.4        & 397.2         \\
    3100 Java @notify@ monitor (  1 000 000)             & 8717.0        & 8774.1        & 471.8         \\
    3101 Java @notify@ monitor (100 000 000)             & 8634.0        & 8683.5        & 330.5         \\
    3102 Pthreads cond. variable         & 5553.7        & 5576.1        & 345.6
     3095\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3096\CFA @signal@, 1 monitor (10M)  & 364.4         & 364.2         & 4.4           \\
     3097\CFA @signal@, 2 monitor (10M)  & 484.4         & 483.9         & 8.8           \\
     3098\CFA @signal@, 4 monitor (10M)  & 709.1         & 707.7         & 15.0          \\
     3099\uC @signal@ monitor (10M)              & 328.3         & 327.4         & 2.4           \\
     3100Rust cond. variable     (1M)            & 7514.0        & 7437.4        & 397.2         \\
     3101Java @notify@ monitor (1M)              & 8717.0        & 8774.1        & 471.8         \\
     3102% Java @notify@ monitor (100 000 000)           & 8634.0        & 8683.5        & 330.5         \\
     3103Pthreads cond. variable (1M)    & 5553.7        & 5576.1        & 345.6
    31033104\end{tabular}
    31043105\end{multicols}
     
    31093110External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
    31103111Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
    3111 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     3112Note, the \CFA incremental cost for bulk acquire is a fixed cost for small numbers of mutex objects.
    31123113
    31133114\begin{multicols}{2}
    3114 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     3115\setlength{\tabcolsep}{5pt}
    31153116\vspace*{-16pt}
    3116 \begin{cfa}
    3117 @monitor@ M {} m1/*, m2, m3, m4*/;
    3118 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
    3119 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3120         for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
     3117\begin{cfa}[xleftmargin=0pt]
     3118`monitor` M {} m1/*, m2, m3, m4*/;
     3119void call( M & `mutex p1/*, p2, p3, p4*/` ) {}
     3120void wait( M & `mutex p1/*, p2, p3, p4*/` ) {
     3121        for ( N ) { `waitfor( call : p1/*, p2, p3, p4*/ );` }
    31213122}
    31223123thread T {};
     
    31353136\columnbreak
    31363137
    3137 \vspace*{-16pt}
     3138\vspace*{-18pt}
    31383139\captionof{table}{External-scheduling comparison (nanoseconds)}
    31393140\label{t:schedext}
    31403141\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3141 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3142 \CFA @waitfor@, 1 monitor       & 367.1 & 365.3 & 5.0   \\
    3143 \CFA @waitfor@, 2 monitor       & 463.0 & 464.6 & 7.1   \\
    3144 \CFA @waitfor@, 4 monitor       & 689.6 & 696.2 & 21.5  \\
    3145 \uC \lstinline[language=uC++]|_Accept| monitor  & 328.2 & 329.1 & 3.4   \\
    3146 Go \lstinline[language=Golang]|select| channel  & 365.0 & 365.5 & 1.2
     3142\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3143\CFA @waitfor@, 1 monitor (10M) & 367.1 & 365.3 & 5.0   \\
     3144\CFA @waitfor@, 2 monitor (10M) & 463.0 & 464.6 & 7.1   \\
     3145\CFA @waitfor@, 4 monitor (10M) & 689.6 & 696.2 & 21.5  \\
     3146\uC \lstinline[language=uC++]|_Accept| monitor (10M)    & 328.2 & 329.1 & 3.4   \\
     3147Go \lstinline[language=Golang]|select| channel (10M)    & 365.0 & 365.5 & 1.2
    31473148\end{tabular}
    31483149\end{multicols}
     
    31573158
    31583159\begin{multicols}{2}
    3159 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3160 \begin{cfa}
    3161 @monitor@ M {} m1/*, m2, m3, m4*/;
    3162 call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
     3160\setlength{\tabcolsep}{3pt}
     3161\begin{cfa}[xleftmargin=0pt]
     3162`monitor` M {} m1/*, m2, m3, m4*/;
     3163call( M & `mutex p1/*, p2, p3, p4*/` ) {}
    31633164int main() {
    31643165        BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
     
    31753176\label{t:mutex}
    31763177\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3177 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3178 test-and-test-set lock                  & 19.1  & 18.9  & 0.4   \\
    3179 \CFA @mutex@ function, 1 arg.   & 48.3  & 47.8  & 0.9   \\
    3180 \CFA @mutex@ function, 2 arg.   & 86.7  & 87.6  & 1.9   \\
    3181 \CFA @mutex@ function, 4 arg.   & 173.4 & 169.4 & 5.9   \\
    3182 \uC @monitor@ member rtn.               & 54.8  & 54.8  & 0.1   \\
    3183 Goroutine mutex lock                    & 34.0  & 34.0  & 0.0   \\
    3184 Rust mutex lock                                 & 33.0  & 33.2  & 0.8   \\
    3185 Java synchronized method (   100 000 000)               & 31.0  & 30.9  & 0.5   \\
    3186 Java synchronized method (10 000 000 000)               & 31.0 & 30.2 & 0.9 \\
    3187 Pthreads mutex Lock                             & 31.0  & 31.1  & 0.4
     3178\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3179test-and-test-set lock (50M)            & 19.1  & 18.9  & 0.4   \\
     3180\CFA @mutex@ function, 1 arg. (50M)     & 48.3  & 47.8  & 0.9   \\
     3181\CFA @mutex@ function, 2 arg. (50M)     & 86.7  & 87.6  & 1.9   \\
     3182\CFA @mutex@ function, 4 arg. (50M)     & 173.4 & 169.4 & 5.9   \\
     3183\uC @monitor@ member rtn. (50M)         & 54.8  & 54.8  & 0.1   \\
     3184Goroutine mutex lock (50M)                      & 34.0  & 34.0  & 0.0   \\
     3185Rust mutex lock (50M)                           & 33.0  & 33.2  & 0.8   \\
     3186Java synchronized method (50M)          & 31.0  & 30.9  & 0.5   \\
     3187% Java synchronized method (10 000 000 000)             & 31.0 & 30.2 & 0.9 \\
     3188Pthreads mutex Lock (50M)                       & 31.0  & 31.1  & 0.4
    31883189\end{tabular}
    31893190\end{multicols}
     
    32143215
    32153216\begin{multicols}{2}
    3216 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3217 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
    3218 @coroutine@ C {};
    3219 void main( C & ) { for () { @suspend;@ } }
     3217\begin{cfa}[xleftmargin=0pt]
     3218`coroutine` C {};
     3219void main( C & ) { for () { `suspend;` } }
    32203220int main() { // coroutine test
    32213221        C c;
    3222         BENCH( for ( N ) { @resume( c );@ } )
     3222        BENCH( for ( N ) { `resume( c );` } )
    32233223        sout | result;
    32243224}
    32253225int main() { // thread test
    3226         BENCH( for ( N ) { @yield();@ } )
     3226        BENCH( for ( N ) { `yield();` } )
    32273227        sout | result;
    32283228}
     
    32373237\label{t:ctx-switch}
    32383238\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3239 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3240 C function                      & 1.8           & 1.8           & 0.0   \\
    3241 \CFA generator          & 1.8           & 2.0           & 0.3   \\
    3242 \CFA coroutine          & 32.5          & 32.9          & 0.8   \\
    3243 \CFA thread                     & 93.8          & 93.6          & 2.2   \\
    3244 \uC coroutine           & 50.3          & 50.3          & 0.2   \\
    3245 \uC thread                      & 97.3          & 97.4          & 1.0   \\
    3246 Python generator        & 40.9          & 41.3          & 1.5   \\
    3247 Node.js await           & 1852.2        & 1854.7        & 16.4  \\
    3248 Node.js generator       & 33.3          & 33.4          & 0.3   \\
    3249 Goroutine thread        & 143.0         & 143.3         & 1.1   \\
    3250 Rust async await        & 32.0          & 32.0          & 0.0   \\
    3251 Rust tokio thread       & 143.0         & 143.0         & 1.7   \\
    3252 Rust thread                     & 332.0         & 331.4         & 2.4   \\
    3253 Java thread     (      100 000)         & 405.0         & 415.0         & 17.6  \\
    3254 Java thread (  100 000 000)                     & 413.0 & 414.2 & 6.2 \\
    3255 Java thread (5 000 000 000)                     & 415.0 & 415.2 & 6.1 \\
    3256 Pthreads thread         & 334.3         & 335.2         & 3.9
     3239\multicolumn{1}{@{}r}{N\hspace*{10pt}} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3240C function (10B)                        & 1.8           & 1.8           & 0.0   \\
     3241\CFA generator (5B)                     & 1.8           & 2.0           & 0.3   \\
     3242\CFA coroutine (100M)           & 32.5          & 32.9          & 0.8   \\
     3243\CFA thread (100M)                      & 93.8          & 93.6          & 2.2   \\
     3244\uC coroutine (100M)            & 50.3          & 50.3          & 0.2   \\
     3245\uC thread (100M)                       & 97.3          & 97.4          & 1.0   \\
     3246Python generator (100M)         & 40.9          & 41.3          & 1.5   \\
     3247Node.js await (5M)                      & 1852.2        & 1854.7        & 16.4  \\
     3248Node.js generator (100M)        & 33.3          & 33.4          & 0.3   \\
     3249Goroutine thread (100M)         & 143.0         & 143.3         & 1.1   \\
     3250Rust async await (100M)         & 32.0          & 32.0          & 0.0   \\
     3251Rust tokio thread (100M)        & 143.0         & 143.0         & 1.7   \\
     3252Rust thread (25M)                       & 332.0         & 331.4         & 2.4   \\
     3253Java thread (100M)                      & 405.0         & 415.0         & 17.6  \\
     3254% Java thread (  100 000 000)                   & 413.0 & 414.2 & 6.2 \\
     3255% Java thread (5 000 000 000)                   & 415.0 & 415.2 & 6.1 \\
     3256Pthreads thread (25M)           & 334.3         & 335.2         & 3.9
    32573257\end{tabular}
    32583258\end{multicols}
     
    32633263Languages using 1:1 threading based on pthreads can at best meet or exceed, due to language overhead, the pthread results.
    32643264Note, pthreads has a fast zero-contention mutex lock checked in user space.
    3265 Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions.
     3265Languages with M:N threading have better performance than 1:1 because there is no operating-system interactions (context-switching or locking).
     3266As well, for locking experiments, M:N threading has less contention if only one kernel thread is used.
    32663267Languages with stackful coroutines have higher cost than stackless coroutines because of stack allocation and context switching;
    32673268however, stackful \uC and \CFA coroutines have approximately the same performance as stackless Python and Node.js generators.
    32683269The \CFA stackless generator is approximately 25 times faster for suspend/resume and 200 times faster for creation than stackless Python and Node.js generators.
     3270The Node.js context-switch is costly when asynchronous await must enter the event engine because a promise is not fulfilled.
     3271Finally, the benchmark results correlate across programming languages with and without JIT, indicating the JIT has completed any runtime optimizations.
    32693272
    32703273
     
    33243327
    33253328The authors recognize the design assistance of Aaron Moss, Rob Schluntz, Andrew Beach, and Michael Brooks; David Dice for commenting and helping with the Java benchmarks; and Gregor Richards for helping with the Node.js benchmarks.
    3326 This research is funded by a grant from Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
     3329This research is funded by the NSERC/Waterloo-Huawei (\url{http://www.huawei.com}) Joint Innovation Lab. %, and Peter Buhr is partially funded by the Natural Sciences and Engineering Research Council of Canada.
    33273330
    33283331{%
  • doc/papers/concurrency/annex/local.bib

    r2fb35df r41b8ea4  
    5959@manual{Cpp-Transactions,
    6060        keywords        = {C++, Transactional Memory},
    61         title           = {Technical Specification for C++ Extensions for Transactional Memory},
     61        title           = {Tech. Spec. for C++ Extensions for Transactional Memory},
    6262        organization= {International Standard ISO/IEC TS 19841:2015 },
    6363        publisher   = {American National Standards Institute},
  • doc/papers/concurrency/mail2

    r2fb35df r41b8ea4  
    959959Software: Practice and Experience Editorial Office
    960960
     961
     962
     963Date: Wed, 2 Sep 2020 20:55:34 +0000
     964From: Richard Jones <onbehalfof@manuscriptcentral.com>
     965Reply-To: R.E.Jones@kent.ac.uk
     966To: tdelisle@uwaterloo.ca, pabuhr@uwaterloo.ca
     967Subject: Software: Practice and Experience - Decision on Manuscript ID
     968 SPE-19-0219.R2
     969
     97002-Sep-2020
     971
     972Dear Dr Buhr,
     973
     974Many thanks for submitting SPE-19-0219.R2 entitled "Advanced Control-flow and Concurrency in Cforall" to Software: Practice and Experience. The paper has now been reviewed and the comments of the referees are included at the bottom of this letter. I apologise for the length of time it has taken to get these.
     975
     976Both reviewers consider this paper to be close to acceptance. However, before I can accept this paper, I would like you address the comments of Reviewer 2, particularly with regard to the description of the adaptation Java harness to deal with warmup. I would expect to see a convincing argument that the computation has reached a steady state. I would also like you to provide the values for N for each benchmark run. This should be very straightforward for you to do. There are a couple of papers on steady state that you may wish to consult (though I am certainly not pushing my own work).
     977
     9781) Barrett, Edd; Bolz-Tereick, Carl Friedrich; Killick, Rebecca; Mount, Sarah and Tratt, Laurence. Virtual Machine Warmup Blows Hot and Cold. OOPSLA 2017. https://doi.org/10.1145/3133876
     979Virtual Machines (VMs) with Just-In-Time (JIT) compilers are traditionally thought to execute programs in two phases: the initial warmup phase determines which parts of a program would most benefit from dynamic compilation, before JIT compiling those parts into machine code; subsequently the program is said to be at a steady state of peak performance. Measurement methodologies almost always discard data collected during the warmup phase such that reported measurements focus entirely on peak performance. We introduce a fully automated statistical approach, based on changepoint analysis, which allows us to determine if a program has reached a steady state and, if so, whether that represents peak performance or not. Using this, we show that even when run in the most controlled of circumstances, small, deterministic, widely studied microbenchmarks often fail to reach a steady state of peak performance on a variety of common VMs. Repeating our experiment on 3 different machines, we found that at most 43.5% of pairs consistently reach a steady state of peak performance.
     980
     9812) Kalibera, Tomas and Jones, Richard. Rigorous Benchmarking in Reasonable Time. ISMM  2013. https://doi.org/10.1145/2555670.2464160
     982Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.
     983
     984You have 42 days from the date of this email to submit your revision. If you are unable to complete the revision within this time, please contact me to request a short extension.
     985
     986You can upload your revised manuscript and submit it through your Author Center. Log into https://mc.manuscriptcentral.com/spe and enter your Author Center, where you will find your manuscript title listed under "Manuscripts with Decisions".
     987
     988When submitting your revised manuscript, you will be able to respond to the comments made by the referee(s) in the space provided.  You can use this space to document any changes you make to the original manuscript.
     989
     990If you would like help with English language editing, or other article preparation support, Wiley Editing Services offers expert help with English Language Editing, as well as translation, manuscript formatting, and figure formatting at www.wileyauthors.com/eeo/preparation. You can also check out our resources for Preparing Your Article for general guidance about writing and preparing your manuscript at www.wileyauthors.com/eeo/prepresources.
     991 
     992Once again, thank you for submitting your manuscript to Software: Practice and Experience. I look forward to receiving your revision.
     993
     994Sincerely,
     995Richard
     996
     997Prof. Richard Jones
     998Editor, Software: Practice and Experience
     999R.E.Jones@kent.ac.uk
     1000
     1001Referee(s)' Comments to Author:
     1002
     1003Reviewing: 1
     1004
     1005Comments to the Author
     1006Overall, I felt that this draft was an improvement on previous drafts and I don't have further changes to request.
     1007
     1008I appreciated the new language to clarify the relationship of external and internal scheduling, for example, as well as the new measurements of Rust tokio. Also, while I still believe that the choice between thread/generator/coroutine and so forth could be made crisper and clearer, the current draft of Section 2 did seem adequate to me in terms of specifying the considerations that users would have to take into account to make the choice.
     1009
     1010
     1011Reviewing: 2
     1012
     1013Comments to the Author
     1014First: let me apologise for the delay on this review. I'll blame the global pandemic combined with my institution's senior management's counterproductive decisions for taking up most of my time and all of my energy.
     1015
     1016At this point, reading the responses, I think we've been around the course enough times that further iteration is unlikely to really improve the paper any further, so I'm happy to recommend acceptance.    My main comments are that there were some good points in the responses to *all* the reviews and I strongly encourage the authors to incorporate those discursive responses into the final paper so they may benefit readers as well as reviewers.   I agree with the recommendations of reviewer #2 that the paper could usefully be split in to two, which I think I made to a previous revision, but I'm happy to leave that decision to the Editor.
     1017
     1018Finally, the paper needs to describe how the Java harness was adapted to deal with warmup; why the computation has warmed up and reached a steady state - similarly for js and Python. The tables should also give the "N" chosen for each benchmark run.
     1019 
     1020minor points
     1021* don't start sentences with "However"
     1022* most downloaded isn't an "Award"
     1023
     1024
     1025
     1026Date: Thu, 1 Oct 2020 05:34:29 +0000
     1027From: Richard Jones <onbehalfof@manuscriptcentral.com>
     1028Reply-To: R.E.Jones@kent.ac.uk
     1029To: pabuhr@uwaterloo.ca
     1030Subject: Revision reminder - SPE-19-0219.R2
     1031
     103201-Oct-2020
     1033
     1034Dear Dr Buhr
     1035
     1036SPE-19-0219.R2
     1037
     1038This is a reminder that your opportunity to revise and re-submit your manuscript will expire 14 days from now. If you require more time please contact me directly and I may grant an extension to this deadline, otherwise the option to submit a revision online, will not be available.
     1039
     1040If your article is of potential interest to the general public, (which means it must be timely, groundbreaking, interesting and impact on everyday society) then please e-mail ejp@wiley.co.uk explaining the public interest side of the research. Wiley will then investigate the potential for undertaking a global press campaign on the article.
     1041
     1042I look forward to receiving your revision.
     1043
     1044Sincerely,
     1045
     1046Prof. Richard Jones
     1047Editor, Software: Practice and Experience
     1048
     1049https://mc.manuscriptcentral.com/spe
     1050
     1051
     1052
     1053Date: Tue, 6 Oct 2020 15:29:41 +0000
     1054From: Mayank Roy Chowdhury <onbehalfof@manuscriptcentral.com>
     1055Reply-To: speoffice@wiley.com
     1056To: tdelisle@uwaterloo.ca, pabuhr@uwaterloo.ca
     1057Subject: SPE-19-0219.R3 successfully submitted
     1058
     105906-Oct-2020
     1060
     1061Dear Dr Buhr,
     1062
     1063Your manuscript entitled "Advanced Control-flow and Concurrency in Cforall" has been successfully submitted online and is presently being given full consideration for publication in Software: Practice and Experience.
     1064
     1065Your manuscript number is SPE-19-0219.R3.  Please mention this number in all future correspondence regarding this submission.
     1066
     1067You can view the status of your manuscript at any time by checking your Author Center after logging into https://mc.manuscriptcentral.com/spe.  If you have difficulty using this site, please click the 'Get Help Now' link at the top right corner of the site.
     1068
     1069
     1070Thank you for submitting your manuscript to Software: Practice and Experience.
     1071
     1072Sincerely,
     1073
     1074Software: Practice and Experience Editorial Office
     1075
Note: See TracChangeset for help on using the changeset viewer.