Changeset fe9cf9e


Ignore:
Timestamp:
Jun 4, 2020, 9:17:02 AM (4 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
aabb846
Parents:
04b4a71
Message:

fixed a few small wording problems

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r04b4a71 rfe9cf9e  
    269269
    270270\abstract[Summary]{
    271 \CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
     271\CFA is a polymorphic, non-object-oriented, concurrent, backwards compatible extension of the C programming language.
    272272This paper discusses the design philosophy and implementation of its advanced control-flow and concurrent/parallel features, along with the supporting runtime written in \CFA.
    273273These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library approaches like pthreads.
     
    301301The TIOBE index~\cite{TIOBE} for May 2020 ranks the top five \emph{popular} programming languages as C 17\%, Java 16\%, Python 9\%, \CC 6\%, and \Csharp 4\% = 52\%, and over the past 30 years, C has always ranked either first or second in popularity.}
    302302allowing immediate dissemination.
    303 This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrent/parallel features in \CFA and its runtime, which is written entirely in \CFA.
    304 The \CFA control-flow framework extends ISO \Celeven~\cite{C11} with new call/return and concurrent/parallel control-flow.
     303This paper discusses the design philosophy and implementation of \CFA's advanced control-flow and concurrent/parallel features, along with the supporting runtime written in \CFA.
    305304
    306305% The call/return extensions retain state between callee and caller versus losing the callee's state on return;
    307306% the concurrency extensions allow high-level management of threads.
    308307
     308The \CFA control-flow framework extends ISO \Celeven~\cite{C11} with new call/return and concurrent/parallel control-flow.
    309309Call/return control-flow with argument and parameter passing appeared in the first programming languages.
    310 Over the past 50 years, call/return has been augmented with features like static/dynamic call, exceptions (multi-level return) and generators/coroutines (retain state between calls).
     310Over the past 50 years, call/return has been augmented with features like static and dynamic call, exceptions (multi-level return) and generators/coroutines (see Section~\ref{s:StatefulFunction}).
    311311While \CFA has mechanisms for dynamic call (algebraic effects~\cite{Zhang19}) and exceptions\footnote{
    312312\CFA exception handling will be presented in a separate paper.
     
    314314\newterm{Coroutining} was introduced by Conway~\cite{Conway63} (1963), discussed by Knuth~\cite[\S~1.4.2]{Knuth73V1}, implemented in Simula67~\cite{Simula67}, formalized by Marlin~\cite{Marlin80}, and is now popular and appears in old and new programming languages: CLU~\cite{CLU}, \Csharp~\cite{Csharp}, Ruby~\cite{Ruby}, Python~\cite{Python}, JavaScript~\cite{JavaScript}, Lua~\cite{Lua}, \CCtwenty~\cite{C++20Coroutine19}.
    315315Coroutining is sequential execution requiring direct handoff among coroutines, \ie only the programmer is controlling execution order.
    316 If coroutines transfer to an internal event-engine for scheduling the next coroutines, the program transitions into the realm of concurrency~\cite[\S~3]{Buhr05a}.
     316If coroutines transfer to an internal event-engine for scheduling the next coroutines (as in async-await), the program transitions into the realm of concurrency~\cite[\S~3]{Buhr05a}.
    317317Coroutines are only a stepping stone towards concurrency where the commonality is that coroutines and threads retain state between calls.
    318318
     
    324324Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}.
    325325Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads.
    326 As a result, many current languages implementations adopt the 1:1 kernel-threading model, like Java (Scala), Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, C\#~\cite{Csharp} and Rust~\cite{Rust}, with a variety of presentation mechanisms.
     326As a result, many languages adopt the 1:1 kernel-threading model, like Java (Scala), Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, C\#~\cite{Csharp} and Rust~\cite{Rust}, with a variety of presentation mechanisms.
    327327From 2000 onwards, several language implementations have championed the M:N user-threading model, like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book}, including putting green threads back into Java~\cite{Quasar}, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}.
    328328The main argument for user-level threading is that it is lighter weight than kernel threading because locking and context switching do not cross the kernel boundary, so there is less restriction on programming styles that encourages large numbers of threads performing medium-sized work to facilitate load balancing by the runtime~\cite{Verch12}.
    329329As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{Adya02,vonBehren03}.
    330 Finally, performant user-threading implementations, both time and space, meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
     330Finally, performant user-threading implementations, both in time and space, meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
    331331
    332332A further effort over the past two decades is the development of language memory models to deal with the conflict between language features and compiler/hardware optimizations, \eg some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}.
     
    338338
    339339Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
    340 Two concurrency violations of this philosophy are \emph{spurious or random wakeup}~\cite[\S~9]{Buhr05a}) and \emph{barging}\footnote{
     340Two concurrency violations of this philosophy are \emph{spurious} or \emph{random wakeup}~\cite[\S~9]{Buhr05a}, and \emph{barging}\footnote{
    341341Barging is competitive succession instead of direct handoff, \ie after a lock is released both arriving and preexisting waiter threads compete to acquire the lock.
    342342Hence, an arriving thread can temporally \emph{barge} ahead of threads already waiting for an event, which can repeat indefinitely leading to starvation of waiter threads.
     
    386386Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
    387387Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads).
    388 Section~\ref{s:Performance} uses a series of microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
     388Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
    389389
    390390
     
    395395To this end, the control-flow features created for \CFA are based on the fundamental properties of any language with function-stack control-flow (see also \uC~\cite[pp.~140-142]{uC++}).
    396396The fundamental properties are execution state, thread, and mutual-exclusion/synchronization (MES).
    397 These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread/task/process.
     397These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread.
    398398While it is possible for a language to only provide threads for composing programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient solutions to certain classes of problems.
    399399As is shown, each of the non-rejected composed language features solves a particular set of problems, and hence, has a defensible position in a programming language.
     
    405405is the state information needed by a control-flow feature to initialize, manage compute data and execution location(s), and de-initialize, \eg calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
    406406State is retained in fixed-sized aggregate structures (objects) and dynamic-sized stack(s), often allocated in the heap(s) managed by the runtime system.
    407 The lifetime of the state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
     407The lifetime of state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
    408408Control-flow transfers among execution states in multiple ways, such as function call, context switch, asynchronous await, etc.
    409409Because the programming language determines what constitutes an execution state, implicitly manages this state, and defines movement mechanisms among states, execution state is an elementary property of the semantics of a programming language.
     
    411411
    412412\item[\newterm{threading}:]
    413 is execution of code that occurs independently of other execution, \ie the execution resulting from a thread is sequential.
     413is execution of code that occurs independently of other execution, where an individual thread's execution is sequential.
    414414Multiple threads provide \emph{concurrent execution};
    415415concurrent execution becomes parallel when run on multiple processing units, \eg hyper-threading, cores, or sockets.
    416 There must be language mechanisms to create, block and unblock, and join with a thread.
     416There must be language mechanisms to create, block and unblock, and join with a thread, even if the mechanism is indirect.
    417417
    418418\item[\newterm{MES}:]
     
    421421Limiting MES, \eg no access to shared data, results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
    422422\end{description}
    423 These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES using atomic hardware mechanisms.
     423These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without atomic hardware mechanisms.
    424424
    425425
     
    432432Table~\ref{t:ExecutionPropertyComposition} shows all combinations of the three fundamental execution properties available to language designers.
    433433(When doing combination case-analysis, not all combinations are meaningful.)
    434 The combinations of state, thread, and mutual exclusion compose a hierarchy of control-flow features all of which have appeared in prior programming languages, where each of these languages have found the feature useful.
     434The combinations of state, thread, and MES compose a hierarchy of control-flow features all of which have appeared in prior programming languages, where each of these languages have found the feature useful.
    435435To understand the table, it is important to review the basic von Neumann execution requirement of at least one thread and execution state providing some form of call stack.
    436436For table entries missing these minimal components, the property is borrowed from the invoker (caller).
     
    468468A @mutex@ structure, often called a \newterm{monitor}, provides a high-level interface for race-free access of shared data in concurrent programming-languages.
    469469Case 3 is case 1 where the structure can implicitly retain execution state and access functions use this execution state to resume/suspend across \emph{callers}, but resume/suspend does not retain a function's local state.
    470 A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, where the stack is used only to preserve state across its callees not callers.
     470A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
    471471Generators provide the first step toward directly solving problems like finite-state machines that retain data and execution state between calls, whereas normal functions restart on each call.
    472472Case 4 is cases 2 and 3 with thread safety during execution of the generator's access functions.
     
    480480Cases 11 and 12 are a stackful thread with and without safe access to shared state.
    481481A thread is the language mechanism to start another thread of control in a program with growable execution state for call/return execution.
    482 In general, more execution properties increase the cost of creation and execution along with complexity of usage.
     482In general, language constructs with more execution properties increase the cost of creation and execution along with complexity of usage.
    483483
    484484Given the execution-properties taxonomy, programmers now ask three basic questions: is state necessary across callers and how much, is a separate thread necessary, is thread safety necessary.
     
    594594&
    595595\begin{cfa}
    596 void * rtn( void * arg ) { ... }
     596void * `rtn`( void * arg ) { ... }
    597597int i = 3, rc;
    598598pthread_t t; $\C{// thread id}$
     
    30373037\end{multicols}
    30383038
     3039\vspace*{-10pt}
     3040\paragraph{Internal Scheduling}
     3041
     3042Internal scheduling is measured using a cycle of two threads signalling and waiting.
     3043Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
     3044Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     3045Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
     3046
     3047\begin{multicols}{2}
     3048\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     3049\begin{cfa}
     3050volatile int go = 0;
     3051@condition c;@
     3052@monitor@ M {} m1/*, m2, m3, m4*/;
     3053void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
     3054        @signal( c );@
     3055}
     3056void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
     3057        go = 1; // continue other thread
     3058        for ( N ) { @wait( c );@ } );
     3059}
     3060thread T {};
     3061void main( T & ) {
     3062        while ( go == 0 ) { yield(); } // waiter must start first
     3063        BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
     3064        sout | result;
     3065}
     3066int main() {
     3067        T t;
     3068        wait( m1/*, m2, m3, m4*/ );
     3069}
     3070\end{cfa}
     3071\vspace*{-8pt}
     3072\captionof{figure}{\CFA Internal-scheduling benchmark}
     3073\label{f:schedint}
     3074
     3075\columnbreak
     3076
     3077\vspace*{-16pt}
     3078\captionof{table}{Internal-scheduling comparison (nanoseconds)}
     3079\label{t:schedint}
     3080\bigskip
     3081
     3082\begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
     3083\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3084\CFA @signal@, 1 monitor        & 364.4         & 364.2         & 4.4           \\
     3085\CFA @signal@, 2 monitor        & 484.4         & 483.9         & 8.8           \\
     3086\CFA @signal@, 4 monitor        & 709.1         & 707.7         & 15.0          \\
     3087\uC @signal@ monitor            & 328.3         & 327.4         & 2.4           \\
     3088Rust cond. variable                     & 7514.0        & 7437.4        & 397.2         \\
     3089Java @notify@ monitor           & 9623.0        & 9654.6        & 236.2         \\
     3090Pthreads cond. variable         & 5553.7        & 5576.1        & 345.6
     3091\end{tabular}
     3092\end{multicols}
     3093
     3094
     3095\paragraph{External Scheduling}
     3096
     3097External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
     3098Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
     3099Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     3100
     3101\begin{multicols}{2}
     3102\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     3103\vspace*{-16pt}
     3104\begin{cfa}
     3105@monitor@ M {} m1/*, m2, m3, m4*/;
     3106void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
     3107void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
     3108        for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
     3109}
     3110thread T {};
     3111void main( T & ) {
     3112        BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
     3113        sout | result;
     3114}
     3115int main() {
     3116        T t;
     3117        wait( m1/*, m2, m3, m4*/ );
     3118}
     3119\end{cfa}
     3120\captionof{figure}{\CFA external-scheduling benchmark}
     3121\label{f:schedext}
     3122
     3123\columnbreak
     3124
     3125\vspace*{-16pt}
     3126\captionof{table}{External-scheduling comparison (nanoseconds)}
     3127\label{t:schedext}
     3128\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     3129\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3130\CFA @waitfor@, 1 monitor       & 367.1 & 365.3 & 5.0   \\
     3131\CFA @waitfor@, 2 monitor       & 463.0 & 464.6 & 7.1   \\
     3132\CFA @waitfor@, 4 monitor       & 689.6 & 696.2 & 21.5  \\
     3133\uC \lstinline[language=uC++]|_Accept| monitor  & 328.2 & 329.1 & 3.4   \\
     3134Go \lstinline[language=Golang]|select| channel  & 365.0 & 365.5 & 1.2
     3135\end{tabular}
     3136\end{multicols}
     3137
     3138\paragraph{Mutual-Exclusion}
     3139
     3140Uncontented mutual exclusion, which frequently occurs, is measured by entering and leaving a critical section.
     3141For monitors, entering and leaving a mutex function is measured, otherwise the language-appropriate mutex-lock is measured.
     3142For comparison, a spinning (versus blocking) test-and-test-set lock is presented.
     3143Figure~\ref{f:mutex} shows the code for \CFA with results in Table~\ref{t:mutex}.
     3144Note the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
     3145
     3146\begin{multicols}{2}
     3147\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
     3148\begin{cfa}
     3149@monitor@ M {} m1/*, m2, m3, m4*/;
     3150call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
     3151int main() {
     3152        BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
     3153        sout | result;
     3154}
     3155\end{cfa}
     3156\captionof{figure}{\CFA acquire/release mutex benchmark}
     3157\label{f:mutex}
     3158
     3159\columnbreak
     3160
     3161\vspace*{-16pt}
     3162\captionof{table}{Mutex comparison (nanoseconds)}
     3163\label{t:mutex}
     3164\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
     3165\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
     3166test-and-test-set lock                  & 19.1  & 18.9  & 0.4   \\
     3167\CFA @mutex@ function, 1 arg.   & 48.3  & 47.8  & 0.9   \\
     3168\CFA @mutex@ function, 2 arg.   & 86.7  & 87.6  & 1.9   \\
     3169\CFA @mutex@ function, 4 arg.   & 173.4 & 169.4 & 5.9   \\
     3170\uC @monitor@ member rtn.               & 54.8  & 54.8  & 0.1   \\
     3171Goroutine mutex lock                    & 34.0  & 34.0  & 0.0   \\
     3172Rust mutex lock                                 & 33.0  & 33.2  & 0.8   \\
     3173Java synchronized method                & 31.0  & 31.0  & 0.0   \\
     3174Pthreads mutex Lock                             & 31.0  & 31.1  & 0.4
     3175\end{tabular}
     3176\end{multicols}
     3177
    30393178\paragraph{Context Switching}
    30403179
     
    31043243\end{multicols}
    31053244
    3106 \vspace*{-10pt}
    3107 \paragraph{Internal Scheduling}
    3108 
    3109 Internal scheduling is measured using a cycle of two threads signalling and waiting.
    3110 Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
    3111 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3112 Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
    3113 
    3114 \begin{multicols}{2}
    3115 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3116 \begin{cfa}
    3117 volatile int go = 0;
    3118 @condition c;@
    3119 @monitor@ M {} m1/*, m2, m3, m4*/;
    3120 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3121         @signal( c );@
    3122 }
    3123 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3124         go = 1; // continue other thread
    3125         for ( N ) { @wait( c );@ } );
    3126 }
    3127 thread T {};
    3128 void main( T & ) {
    3129         while ( go == 0 ) { yield(); } // waiter must start first
    3130         BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
    3131         sout | result;
    3132 }
    3133 int main() {
    3134         T t;
    3135         wait( m1/*, m2, m3, m4*/ );
    3136 }
    3137 \end{cfa}
    3138 \vspace*{-8pt}
    3139 \captionof{figure}{\CFA Internal-scheduling benchmark}
    3140 \label{f:schedint}
    3141 
    3142 \columnbreak
    3143 
    3144 \vspace*{-16pt}
    3145 \captionof{table}{Internal-scheduling comparison (nanoseconds)}
    3146 \label{t:schedint}
    3147 \bigskip
    3148 
    3149 \begin{tabular}{@{}r*{3}{D{.}{.}{5.2}}@{}}
    3150 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3151 \CFA @signal@, 1 monitor        & 364.4         & 364.2         & 4.4           \\
    3152 \CFA @signal@, 2 monitor        & 484.4         & 483.9         & 8.8           \\
    3153 \CFA @signal@, 4 monitor        & 709.1         & 707.7         & 15.0          \\
    3154 \uC @signal@ monitor            & 328.3         & 327.4         & 2.4           \\
    3155 Rust cond. variable                     & 7514.0        & 7437.4        & 397.2         \\
    3156 Java @notify@ monitor           & 9623.0        & 9654.6        & 236.2         \\
    3157 Pthreads cond. variable         & 5553.7        & 5576.1        & 345.6
    3158 \end{tabular}
    3159 \end{multicols}
    3160 
    3161 
    3162 \paragraph{External Scheduling}
    3163 
    3164 External scheduling is measured using a cycle of two threads calling and accepting the call using the @waitfor@ statement.
    3165 Figure~\ref{f:schedext} shows the code for \CFA with results in Table~\ref{t:schedext}.
    3166 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3167 
    3168 \begin{multicols}{2}
    3169 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3170 \vspace*{-16pt}
    3171 \begin{cfa}
    3172 @monitor@ M {} m1/*, m2, m3, m4*/;
    3173 void call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
    3174 void wait( M & @mutex p1/*, p2, p3, p4*/@ ) {
    3175         for ( N ) { @waitfor( call : p1/*, p2, p3, p4*/ );@ }
    3176 }
    3177 thread T {};
    3178 void main( T & ) {
    3179         BENCH( for ( N ) { call( m1/*, m2, m3, m4*/ ); } )
    3180         sout | result;
    3181 }
    3182 int main() {
    3183         T t;
    3184         wait( m1/*, m2, m3, m4*/ );
    3185 }
    3186 \end{cfa}
    3187 \captionof{figure}{\CFA external-scheduling benchmark}
    3188 \label{f:schedext}
    3189 
    3190 \columnbreak
    3191 
    3192 \vspace*{-16pt}
    3193 \captionof{table}{External-scheduling comparison (nanoseconds)}
    3194 \label{t:schedext}
    3195 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3196 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3197 \CFA @waitfor@, 1 monitor       & 367.1 & 365.3 & 5.0   \\
    3198 \CFA @waitfor@, 2 monitor       & 463.0 & 464.6 & 7.1   \\
    3199 \CFA @waitfor@, 4 monitor       & 689.6 & 696.2 & 21.5  \\
    3200 \uC \lstinline[language=uC++]|_Accept| monitor  & 328.2 & 329.1 & 3.4   \\
    3201 Go \lstinline[language=Golang]|select| channel  & 365.0 & 365.5 & 1.2
    3202 \end{tabular}
    3203 \end{multicols}
    3204 
    3205 \paragraph{Mutual-Exclusion}
    3206 
    3207 Uncontented mutual exclusion, which frequently occurs, is measured by entering and leaving a critical section.
    3208 For monitors, entering and leaving a mutex function is measured, otherwise the language-appropriate mutex-lock is measured.
    3209 For comparison, a spinning (versus blocking) test-and-test-set lock is presented.
    3210 Figure~\ref{f:mutex} shows the code for \CFA with results in Table~\ref{t:mutex}.
    3211 Note the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3212 
    3213 \begin{multicols}{2}
    3214 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    3215 \begin{cfa}
    3216 @monitor@ M {} m1/*, m2, m3, m4*/;
    3217 call( M & @mutex p1/*, p2, p3, p4*/@ ) {}
    3218 int main() {
    3219         BENCH( for( N ) call( m1/*, m2, m3, m4*/ ); )
    3220         sout | result;
    3221 }
    3222 \end{cfa}
    3223 \captionof{figure}{\CFA acquire/release mutex benchmark}
    3224 \label{f:mutex}
    3225 
    3226 \columnbreak
    3227 
    3228 \vspace*{-16pt}
    3229 \captionof{table}{Mutex comparison (nanoseconds)}
    3230 \label{t:mutex}
    3231 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    3232 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    3233 test-and-test-set lock                  & 19.1  & 18.9  & 0.4   \\
    3234 \CFA @mutex@ function, 1 arg.   & 48.3  & 47.8  & 0.9   \\
    3235 \CFA @mutex@ function, 2 arg.   & 86.7  & 87.6  & 1.9   \\
    3236 \CFA @mutex@ function, 4 arg.   & 173.4 & 169.4 & 5.9   \\
    3237 \uC @monitor@ member rtn.               & 54.8  & 54.8  & 0.1   \\
    3238 Goroutine mutex lock                    & 34.0  & 34.0  & 0.0   \\
    3239 Rust mutex lock                                 & 33.0  & 33.2  & 0.8   \\
    3240 Java synchronized method                & 31.0  & 31.0  & 0.0   \\
    3241 Pthreads mutex Lock                             & 31.0  & 31.1  & 0.4
    3242 \end{tabular}
    3243 \end{multicols}
    3244 
    32453245
    32463246\subsection{Discussion}
Note: See TracChangeset for help on using the changeset viewer.