Changeset 8f079f0 for doc/papers


Ignore:
Timestamp:
Jun 23, 2019, 11:41:59 PM (5 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
b60ed54
Parents:
f2f22e3
Message:

Thierry changes to draft

Location:
doc/papers
Files:
3 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/AMA/AMA-stix/ama/WileyNJD-v2.cls

    rf2f22e3 r8f079f0  
    24442444     \@afterheading}
    24452445
    2446 \renewcommand\section{\@startsection{section}{1}{\z@}{-27pt \@plus -2pt \@minus -2pt}{12\p@}{\sectionfont}}%
    2447 \renewcommand\subsection{\@startsection{subsection}{2}{\z@}{-23pt \@plus -2pt \@minus -2pt}{5\p@}{\subsectionfont}}%
     2446\renewcommand\section{\@startsection{section}{1}{\z@}{-25pt \@plus -2pt \@minus -2pt}{12\p@}{\sectionfont}}%
     2447\renewcommand\subsection{\@startsection{subsection}{2}{\z@}{-22pt \@plus -2pt \@minus -2pt}{5\p@}{\subsectionfont}}%
    24482448\renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}{-20pt \@plus -2pt \@minus -2pt}{2\p@}{\subsubsectionfont}}%
    24492449%
  • doc/papers/concurrency/Paper.tex

    rf2f22e3 r8f079f0  
    157157                __float80, float80, __float128, float128, forall, ftype, generator, _Generic, _Imaginary, __imag, __imag__,
    158158                inline, __inline, __inline__, __int128, int128, __label__, monitor, mutex, _Noreturn, one_t, or,
    159                 otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, suspend, thread,
     159                otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, thread,
    160160                _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__,
    161161                virtual, __volatile, __volatile__, waitfor, when, with, zero_t},
     
    303303However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone;
    304304no high-level language concurrency features are defined.
    305 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
     305Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
    306306Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
    307307
     
    313313From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}, including putting green threads back into Java~\cite{Quasar}.
    314314The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}.
    315 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}.
     315As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{Adya02,vonBehren03}.
    316316Finally, performant user-threading implementations (both time and space) meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
    317317
     
    613613Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.
    614614\begin{cfa}
    615 int ?()( Fib & fib ) with( fib ) { return `resume( fib )`.fn; }   // function-call interface
    616 int ?()( Fib & fib, int N ) with( fib ) { for ( N - 1 ) `fib()`; return `fib()`; }   // use simple interface
    617 double ?()( Fib & fib ) with( fib ) { return (int)`fib()` / 3.14159; } // cast prevents recursive call
    618 sout | (int)f1() | (double)f1() | f2( 2 );   // simple interface, cast selects call based on return type, step 2 values
     615int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$
     616int ?()( Fib & fib, int N ) { for ( N - 1 ) `fib()`; return `fib()`; } $\C{// use function-call interface to skip N values}$
     617double ?()( Fib & fib ) { return (int)`fib()` / 3.14159; } $\C{// different return type, cast prevents recursive call}\CRT$
     618sout | (int)f1() | (double)f1() | f2( 2 ); // alternative interface, cast selects call based on return type, step 2 values
    619619\end{cfa}
    620620Now, the generator can be a separately-compiled opaque-type only accessed through its interface functions.
     
    628628With respect to safety, we believe static analysis can discriminate local state from temporary variables in a generator, \ie variable usage spanning @suspend@, and generate a compile-time error.
    629629Finally, our current experience is that most generator problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small.
    630 As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators.
     630As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small, fast generators.
    631631
    632632Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored.
     
    796796This semantics is basically a tail-call optimization, which compilers already perform.
    797797The example shows the assembly code to undo the generator's entry code before the direct jump.
    798 This assembly code depends on what entry code is generated, specifically if there are local variables and the level of optimization.
     798This assembly code depends on what entry code is generated, specifically if there are local variables, and the level of optimization.
    799799To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time.
    800800Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing.
    801 A compiler could also eliminate other artifacts in the generator simulation to further increase performance.
     801A compiler could also eliminate other artifacts in the generator simulation to further increase performance, \eg LLVM has various coroutine support~\cite{CoroutineTS}, and \CFA can leverage this support should it fork @clang@.
    802802
    803803\begin{figure}
     
    12131213Hence, a compromise solution is necessary that works for asymmetric (acyclic) and symmetric (cyclic) coroutines.
    12141214
    1215 Our solution for coroutine termination works well for the most common asymmetric and symmetric coroutine usage-patterns.
     1215Our solution is to context switch back to the first resumer (starter) once the coroutine ends.
     1216This semantics works well for the most common asymmetric and symmetric coroutine usage-patterns.
    12161217For asymmetric coroutines, it is common for the first resumer (starter) coroutine to be the only resumer.
    12171218All previous generators converted to coroutines have this property.
     
    12451246
    12461247
    1247 \subsection{(Generator) Coroutine Implementation}
     1248\subsection{Generator / Coroutine Implementation}
    12481249
    12491250A significant implementation challenge for generators/coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack.
     
    12541255class myCoroutine inherits baseCoroutine { ... }
    12551256\end{cfa}
    1256 The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
    1257 As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
     1257% The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
     1258The problem is that some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
    12581259Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors.
    12591260An alternative is composition:
     
    12671268However, there is nothing preventing wrong placement or multiple declarations.
    12681269
    1269 \CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code and when it is unsafe to perform certain optimizations, and IDEs using simple parsing can find and manipulate types with special properties.
     1270\CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code
     1271% and when it is unsafe to perform certain optimizations,
     1272and IDEs using simple parsing can find and manipulate types with special properties.
    12701273The downside of this approach is that it makes custom types a special case in the language.
    12711274Users wanting to extend custom types or build their own can only do so in ways offered by the language.
     
    12821285\end{cfa}
    12831286Note, copying generators/coroutines/threads is not meaningful.
    1284 For example, a coroutine retains its last resumer and suspends back to it;
    1285 having a copy also suspend back to the same resumer is undefined semantics.
     1287For example, both the resumer and suspender descriptors can have bi-directional pointers;
     1288copying these coroutines does not update the internal pointers so behaviour of both copies would be difficult to understand.
    12861289Furthermore, two coroutines cannot logically execute on the same stack.
    12871290A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy.
    12881291The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer).
    1289 The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.
     1292The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the coroutine descriptor from its handle.
    12901293The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones.
    12911294The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@.
     
    14931496\end{tabular}
    14941497\end{cquote}
    1495 Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_coroutine@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
    1496 Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing.
     1498Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_thread@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
     1499Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the thread descriptor from its handle, and a special destructor to prevent deallocation while the thread is executing.
    14971500(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.)
    14981501The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack and starts running the coroutine main on the stack;
     
    16171620% Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data.
    16181621% Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity.
    1619 Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data.
     1622Similarly, the function definitions ensures there is a mechanism to get (read) the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread using the shared data.
    16201623The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics.
    16211624
     
    16561659called \newterm{bulk acquire}.
    16571660\CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock.
    1658 Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
     1661Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem~\cite{BankTransfer}, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
    16591662A \CFA programmer only has to manage when to acquire mutual exclusion;
    16601663a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options.
     
    18001803Figure~\ref{f:MonitorScheduling} shows general internal/external scheduling (for the bounded-buffer example in Figure~\ref{f:InternalExternalScheduling}).
    18011804External calling threads block on the calling queue, if the monitor is occupied, otherwise they enter in FIFO order.
    1802 Internal threads block on condition queues via @wait@ and they reenter from the condition in FIFO order, or they block on urgent via @signal_block@ or @waitfor@ and reenter implicit when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
     1805Internal threads block on condition queues via @wait@ and reenter from the condition in FIFO order.
     1806Alternatively, internal threads block on urgent from the @signal_block@ or @waitfor@, and reenter implicitly when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
    18031807
    18041808There are three signalling mechanisms to unblock waiting threads to enter the monitor.
    1805 Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion so only one can proceed.
     1809Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion, so either the signaller or signallee can proceed.
    18061810For internal scheduling, threads are unblocked from condition queues using @signal@, where the signallee is moved to urgent and the signaller continues (solid line).
    18071811Multiple signals move multiple signallees to urgent, until the condition is empty.
     
    18431847It is common to declare condition variables as monitor fields to prevent shared access, hence no locking is required for access as the conditions are protected by the monitor lock.
    18441848In \CFA, a condition variable can be created/stored independently.
    1845 To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolen test to detect sharing from other monitors.
     1849% To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolean test to detect sharing from other monitors.
    18461850
    18471851% Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means:
     
    18521856% The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait).
    18531857% \item
    1854 % The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues.
     1858% The signalling thread blocks but is marked for urgent unblocking at the next scheduling point and the signalled thread continues.
    18551859% \end{enumerate}
    18561860% The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}).
     
    19611965External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion.
    19621966If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer.
    1963 Calls threads to functions that are currently excluded block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
     1967Threads calling excluded functions block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
    19641968Figure~\ref{f:RWExt} shows a readers/writer lock written using external scheduling, where a waiting reader detects a writer using the resource and restricts further calls until the writer exits by calling @EndWrite@.
    19651969The writer does a similar action for each reader or writer using the resource.
     
    20762080For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex parameters, \ie @wait( e, m1, m2 )@.
    20772081To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@.
    2078 Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe.
     2082Wait cannot statically verifies the released monitors are the acquired mutex-parameters without disallowing separately compiled helper functions calling @wait@.
    20792083While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric.
    20802084Finally, a signaller,
     
    20882092Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex parameters, \ie @waitfor( rtn, m1, m2 )@.
    20892093To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@.
    2090 @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer.
    2091 To statically verify the released monitors match with the accepted function's mutex parameters, the function (pointer) prototype must be accessible.
     2094@waitfor@ does statically verify the monitor types passed are the same as the acquired mutex-parameters of the given function or function pointer, hence the function (pointer) prototype must be accessible.
    20922095% When an overloaded function appears in an @waitfor@ statement, calls to any function with that name are accepted.
    20932096% The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call.
     
    21482151The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true.
    21492152
    2150 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated.
    2151 \begin{cfa}
    2152 void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) {
    2153         if ( count == 10 )
    2154                 waitfor( remove, buffer ) {
    2155                         // insert elem into buffer
    2156                 } or `waitfor( ^?{}, buffer )` throw insertFail;
    2157 }
    2158 \end{cfa}
    2159 When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action.
    2160 However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined.
    2161 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object.
    2162 Accepting the destructor is the idiomatic way to terminate a thread in \CFA.
     2153An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated, \eg assume the bounded buffer is restructred from a monitor to a thread with the following @main@.
     2154\begin{cfa}
     2155void main( Buffer(T) & buffer ) with(buffer) {
     2156        for () {
     2157                `waitfor( ^?{}, buffer )` break;
     2158                or when ( count != 20 ) waitfor( insert, buffer ) { ... }
     2159                or when ( count != 0 ) waitfor( remove, buffer ) { ... }
     2160        }
     2161        // clean up
     2162}
     2163\end{cfa}
     2164When the program main deallocates the buffer, it first calls the buffer's destructor, which is accepted, the destructor runs, and the buffer is deallocated.
     2165However, the buffer thread cannot continue after the destructor call because the object is gone;
     2166hence, clean up in @main@ cannot occur, which means destructors for local objects are not run.
     2167To make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the destructor call is placed on urgent and the acceptor continues execution, which ends the loop, cleans up, and the thread terminates.
     2168Then, the destructor caller unblocks from urgent to deallocate the object.
     2169Accepting the destructor is the idiomatic way in \CFA to terminate a thread performing direct communication.
    21632170
    21642171
     
    23572364
    23582365struct Msg { int i, j; };
    2359 thread Gortn { int i;  float f;  Msg m; };
    2360 void mem1( Gortn & mutex gortn, int i ) { gortn.i = i; }
    2361 void mem2( Gortn & mutex gortn, float f ) { gortn.f = f; }
    2362 void mem3( Gortn & mutex gortn, Msg m ) { gortn.m = m; }
    2363 void ^?{}( Gortn & mutex ) {}
    2364 
    2365 void main( Gortn & gortn ) with( gortn ) {  // thread starts
     2366thread GoRtn { int i;  float f;  Msg m; };
     2367void mem1( GoRtn & mutex gortn, int i ) { gortn.i = i; }
     2368void mem2( GoRtn & mutex gortn, float f ) { gortn.f = f; }
     2369void mem3( GoRtn & mutex gortn, Msg m ) { gortn.m = m; }
     2370void ^?{}( GoRtn & mutex ) {}
     2371
     2372void main( GoRtn & gortn ) with( gortn ) {  // thread starts
    23662373
    23672374        for () {
     
    23762383}
    23772384int main() {
    2378         Gortn gortn; $\C[2.0in]{// start thread}$
     2385        GoRtn gortn; $\C[2.0in]{// start thread}$
    23792386        `mem1( gortn, 0 );` $\C{// different calls}\CRT$
    23802387        `mem2( gortn, 2.5 );`
     
    25342541% However, preemption is necessary for fairness and to reduce tail-latency.
    25352542% For concurrency that relies on spinning, if all cores spin the system is livelocked, whereas preemption breaks the livelock.
    2536 %
    2537 %
    2538 % \subsection{Thread Pools}
    2539 %
    2540 % In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
    2541 % If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
    2542 % While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
    2543 % Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
    2544 % While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
    2545 % As well, concurrency errors return, which threads pools are suppose to mitigate.
     2543
     2544
     2545\begin{comment}
     2546\subsection{Thread Pools}
     2547
     2548In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
     2549If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
     2550While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
     2551Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
     2552While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
     2553As well, concurrency errors return, which threads pools are suppose to mitigate.
     2554
     2555\begin{figure}
     2556\centering
     2557\begin{tabular}{@{}l|l@{}}
     2558\begin{cfa}
     2559struct Adder {
     2560    int * row, cols;
     2561};
     2562int operator()() {
     2563        subtotal = 0;
     2564        for ( int c = 0; c < cols; c += 1 )
     2565                subtotal += row[c];
     2566        return subtotal;
     2567}
     2568void ?{}( Adder * adder, int row[$\,$], int cols, int & subtotal ) {
     2569        adder.[rows, cols, subtotal] = [rows, cols, subtotal];
     2570}
     2571
     2572
     2573
     2574
     2575\end{cfa}
     2576&
     2577\begin{cfa}
     2578int main() {
     2579        const int rows = 10, cols = 10;
     2580        int matrix[rows][cols], subtotals[rows], total = 0;
     2581        // read matrix
     2582        Executor executor( 4 ); // kernel threads
     2583        Adder * adders[rows];
     2584        for ( r; rows ) { // send off work for executor
     2585                adders[r] = new( matrix[r], cols, &subtotal[r] );
     2586                executor.send( *adders[r] );
     2587        }
     2588        for ( r; rows ) {       // wait for results
     2589                delete( adders[r] );
     2590                total += subtotals[r];
     2591        }
     2592        sout | total;
     2593}
     2594\end{cfa}
     2595\end{tabular}
     2596\caption{Executor}
     2597\end{figure}
     2598\end{comment}
    25462599
    25472600
     
    25672620The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults.
    25682621The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors.
    2569 However, the scheduler is pluggable, supporting alternative schedulers, such as multi-queue multi-server, with work-stealing/sharing across the virtual processors.
     2622However, the design allows changing the scheduler, \eg multi-queue multi-server with work-stealing/sharing across the virtual processors.
    25702623If several clusters exist, both threads and virtual processors, can be explicitly migrated from one cluster to another.
    25712624No automatic load balancing among clusters is performed by \CFA.
     
    25742627The user cluster is created to contain the application user-threads.
    25752628Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime.
    2576 However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), multiple clusters are sometimes necessary.
     2629However, because of limitations of scheduling requirements (real-time), NUMA architecture, heterogeneous hardware, or issues with the underlying operating system, multiple clusters are sometimes necessary.
    25772630
    25782631
     
    26182671\subsection{Preemption}
    26192672
    2620 Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on section of code between cooperative scheduling to be atomic.
    2621 A separate reason for not supporting preemption is that it significantly complicates the runtime system.
     2673Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on code between cooperative scheduling to be atomic.
     2674This atomic reliance can fail on multi-core machines, because execution across cores is nondeterministic.
     2675A different reason for not supporting preemption is that it significantly complicates the runtime system, \eg Microsoft runtime does not support interrupts and on Linux systems, interrupts are complex (see below).
    26222676Preemption is normally handled by setting a count-down timer on each virtual processor.
    26232677When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code.
     
    26282682Because preemption frequency is usually long (1 millisecond) performance cost is negligible.
    26292683
    2630 However, on current Linux systems:
     2684Linux switched a decade ago from specific to arbitrary process signal-delivery for applications with multiple kernel threads.
    26312685\begin{cquote}
    26322686A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked.
     
    26342688SIGNAL(7) - Linux Programmer's Manual
    26352689\end{cquote}
    2636 Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to the Linux process, is delivered to any of its Linux subprocesses (kernel threads).
    2637 To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
     2690Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to an application, is delivered to any of its Linux subprocesses (kernel threads).
     2691To ensure each virtual processor receives a preemption signal, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
    26382692Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order.
    26392693The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed.
     
    26522706
    26532707To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0.
    2654 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.3 LTS and \uC and \CFA are compiled with gcc 6.3.
     2708The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \uC/\CFA are compiled with gcc 6.5.
    26552709
    26562710\begin{comment}
     
    27072761\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    27082762\begin{cfa}
    2709 thread MyThread {};
    2710 void main( MyThread & ) {}
     2763@thread@ MyThread {};
     2764void @main@( MyThread & ) {}
    27112765int main() {
    27122766        BENCH( for ( N ) { @MyThread m;@ } )
     
    27502804\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    27512805\begin{cfa}[aboveskip=0pt,belowskip=0pt]
    2752 coroutine C {} c;
     2806@coroutine@ C {} c;
    27532807void main( C & ) { for ( ;; ) { @suspend;@ } }
    27542808int main() { // coroutine test
     
    27712825\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    27722826\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2773 Kernel Thread   & 333.5 & 332.96        & 4.1   \\
     2827C function              & 2                     & 2             & 0             \\
     2828\CFA generator  & 2                     & 2             & 0             \\
    27742829\CFA Coroutine  & 49    & 48.68         & 0.47  \\
    27752830\CFA Thread             & 105   & 105.57        & 1.37  \\
     
    27772832\uC Thread              & 100   & 99.29         & 0.96  \\
    27782833Goroutine               & 145   & 147.25        & 4.15  \\
    2779 Java Thread             & 373.5 & 375.14        & 8.72
     2834Java Thread             & 373.5 & 375.14        & 8.72  \\
     2835Pthreads Thread & 333.5 & 332.96        & 4.1
    27802836\end{tabular}
    27812837\end{multicols}
     
    27932849\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    27942850\begin{cfa}
    2795 monitor M {} m1/*, m2, m3, m4*/;
     2851@monitor@ M {} m1/*, m2, m3, m4*/;
    27962852void __attribute__((noinline))
    2797 do_call( M & mutex m/*, m2, m3, m4*/ ) {}
     2853do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
    27982854int main() {
    27992855        BENCH(
    2800                 for( N ) @do_call( m1/*, m2, m3, m4*/ );@
     2856                for( N ) do_call( m1/*, m2, m3, m4*/ );
    28012857        )
    28022858        sout | result`ns;
     
    28132869\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    28142870\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2815 C function                                              & 2                     & 2             & 0             \\
    2816 FetchAdd + FetchSub                             & 26            & 26    & 0             \\
     2871test and test-and-test lock             & 26            & 26    & 0             \\
    28172872Pthreads Mutex Lock                             & 31            & 31.71 & 0.97  \\
    28182873\uC @monitor@ member rtn.               & 31            & 31    & 0             \\
     
    28202875\CFA @mutex@ function, 2 arg.   & 84            & 85.36 & 1.99  \\
    28212876\CFA @mutex@ function, 4 arg.   & 158           & 161   & 4.22  \\
    2822 Java synchronized function              & 27.5          & 29.79 & 2.93
     2877Java synchronized method                & 27.5          & 29.79 & 2.93
    28232878\end{tabular}
    28242879\end{multicols}
     
    28362891\begin{cfa}
    28372892volatile int go = 0;
    2838 monitor M { condition c; } m;
     2893@monitor@ M { @condition c;@ } m;
    28392894void __attribute__((noinline))
    2840 do_call( M & mutex a1 ) { @signal( c );@ }
     2895do_call( M & @mutex@ a1 ) { @signal( c );@ }
    28412896thread T {};
    28422897void main( T & this ) {
     
    28692924\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    28702925Pthreads Cond. Variable         & 6005          & 5681.43       & 835.45        \\
    2871 \uC @signal@                            & 324           & 325.54        & 3,02          \\
     2926\uC @signal@                            & 324           & 325.54        & 3.02          \\
    28722927\CFA @signal@, 1 @monitor@      & 368.5         & 370.61        & 4.77          \\
    28732928\CFA @signal@, 2 @monitor@      & 467           & 470.5         & 6.79          \\
     
    28892944\begin{cfa}
    28902945volatile int go = 0;
    2891 monitor M {} m;
     2946@monitor@ M {} m;
    28922947thread T {};
    28932948void __attribute__((noinline))
    2894 do_call( M & mutex ) {}
     2949do_call( M & @mutex@ ) {}
    28952950void main( T & ) {
    28962951        while ( go == 0 ) { yield(); }
    2897         while ( go == 1 ) { @do_call( m );@ }
     2952        while ( go == 1 ) { do_call( m ); }
    28982953}
    28992954int __attribute__((noinline))
    2900 do_wait( M & mutex m ) {
     2955do_wait( M & @mutex@ m ) {
    29012956        go = 1; // continue other thread
    29022957        BENCH( for ( N ) { @waitfor( do_call, m );@ } )
  • doc/papers/concurrency/annex/local.bib

    rf2f22e3 r8f079f0  
    6666}
    6767
    68 @article{BankTransfer,
     68@misc{BankTransfer,
    6969        key     = {Bank Transfer},
    7070        keywords        = {Bank Transfer},
    7171        title   = {Bank Account Transfer Problem},
    72         publisher       = {Wiki Wiki Web},
    73         address = {http://wiki.c2.com},
     72        howpublished    = {Wiki Wiki Web, \url{http://wiki.c2.com/?BankAccountTransferProblem}},
    7473        year            = 2010
    7574}
Note: See TracChangeset for help on using the changeset viewer.