Ignore:
File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r8f079f0 r4487667  
    157157                __float80, float80, __float128, float128, forall, ftype, generator, _Generic, _Imaginary, __imag, __imag__,
    158158                inline, __inline, __inline__, __int128, int128, __label__, monitor, mutex, _Noreturn, one_t, or,
    159                 otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, thread,
     159                otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, suspend, thread,
    160160                _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__,
    161161                virtual, __volatile, __volatile__, waitfor, when, with, zero_t},
     
    303303However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone;
    304304no high-level language concurrency features are defined.
    305 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
     305Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
    306306Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
    307307
     
    313313From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}, including putting green threads back into Java~\cite{Quasar}.
    314314The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}.
    315 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{Adya02,vonBehren03}.
     315As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}.
    316316Finally, performant user-threading implementations (both time and space) meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
    317317
     
    613613Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.
    614614\begin{cfa}
    615 int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$
    616 int ?()( Fib & fib, int N ) { for ( N - 1 ) `fib()`; return `fib()`; } $\C{// use function-call interface to skip N values}$
    617 double ?()( Fib & fib ) { return (int)`fib()` / 3.14159; } $\C{// different return type, cast prevents recursive call}\CRT$
    618 sout | (int)f1() | (double)f1() | f2( 2 ); // alternative interface, cast selects call based on return type, step 2 values
     615int ?()( Fib & fib ) with( fib ) { return `resume( fib )`.fn; }   // function-call interface
     616int ?()( Fib & fib, int N ) with( fib ) { for ( N - 1 ) `fib()`; return `fib()`; }   // use simple interface
     617double ?()( Fib & fib ) with( fib ) { return (int)`fib()` / 3.14159; } // cast prevents recursive call
     618sout | (int)f1() | (double)f1() | f2( 2 );   // simple interface, cast selects call based on return type, step 2 values
    619619\end{cfa}
    620620Now, the generator can be a separately-compiled opaque-type only accessed through its interface functions.
     
    628628With respect to safety, we believe static analysis can discriminate local state from temporary variables in a generator, \ie variable usage spanning @suspend@, and generate a compile-time error.
    629629Finally, our current experience is that most generator problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small.
    630 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small, fast generators.
     630As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators.
    631631
    632632Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored.
     
    796796This semantics is basically a tail-call optimization, which compilers already perform.
    797797The example shows the assembly code to undo the generator's entry code before the direct jump.
    798 This assembly code depends on what entry code is generated, specifically if there are local variables, and the level of optimization.
     798This assembly code depends on what entry code is generated, specifically if there are local variables and the level of optimization.
    799799To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time.
    800800Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing.
    801 A compiler could also eliminate other artifacts in the generator simulation to further increase performance, \eg LLVM has various coroutine support~\cite{CoroutineTS}, and \CFA can leverage this support should it fork @clang@.
     801A compiler could also eliminate other artifacts in the generator simulation to further increase performance.
    802802
    803803\begin{figure}
     
    12131213Hence, a compromise solution is necessary that works for asymmetric (acyclic) and symmetric (cyclic) coroutines.
    12141214
    1215 Our solution is to context switch back to the first resumer (starter) once the coroutine ends.
    1216 This semantics works well for the most common asymmetric and symmetric coroutine usage-patterns.
     1215Our solution for coroutine termination works well for the most common asymmetric and symmetric coroutine usage-patterns.
    12171216For asymmetric coroutines, it is common for the first resumer (starter) coroutine to be the only resumer.
    12181217All previous generators converted to coroutines have this property.
     
    12461245
    12471246
    1248 \subsection{Generator / Coroutine Implementation}
     1247\subsection{(Generator) Coroutine Implementation}
    12491248
    12501249A significant implementation challenge for generators/coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack.
     
    12551254class myCoroutine inherits baseCoroutine { ... }
    12561255\end{cfa}
    1257 % The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
    1258 The problem is that some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
     1256The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
     1257As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
    12591258Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors.
    12601259An alternative is composition:
     
    12681267However, there is nothing preventing wrong placement or multiple declarations.
    12691268
    1270 \CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code
    1271 % and when it is unsafe to perform certain optimizations,
    1272 and IDEs using simple parsing can find and manipulate types with special properties.
     1269\CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code and when it is unsafe to perform certain optimizations, and IDEs using simple parsing can find and manipulate types with special properties.
    12731270The downside of this approach is that it makes custom types a special case in the language.
    12741271Users wanting to extend custom types or build their own can only do so in ways offered by the language.
     
    12851282\end{cfa}
    12861283Note, copying generators/coroutines/threads is not meaningful.
    1287 For example, both the resumer and suspender descriptors can have bi-directional pointers;
    1288 copying these coroutines does not update the internal pointers so behaviour of both copies would be difficult to understand.
     1284For example, a coroutine retains its last resumer and suspends back to it;
     1285having a copy also suspend back to the same resumer is undefined semantics.
    12891286Furthermore, two coroutines cannot logically execute on the same stack.
    12901287A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy.
    12911288The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer).
    1292 The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the coroutine descriptor from its handle.
     1289The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.
    12931290The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones.
    12941291The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@.
     
    14961493\end{tabular}
    14971494\end{cquote}
    1498 Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_thread@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
    1499 Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the thread descriptor from its handle, and a special destructor to prevent deallocation while the thread is executing.
     1495Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_coroutine@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
     1496Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing.
    15001497(The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.)
    15011498The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack and starts running the coroutine main on the stack;
     
    16201617% Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data.
    16211618% Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity.
    1622 Similarly, the function definitions ensures there is a mechanism to get (read) the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread using the shared data.
     1619Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data.
    16231620The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics.
    16241621
     
    16591656called \newterm{bulk acquire}.
    16601657\CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock.
    1661 Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem~\cite{BankTransfer}, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
     1658Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
    16621659A \CFA programmer only has to manage when to acquire mutual exclusion;
    16631660a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options.
     
    18031800Figure~\ref{f:MonitorScheduling} shows general internal/external scheduling (for the bounded-buffer example in Figure~\ref{f:InternalExternalScheduling}).
    18041801External calling threads block on the calling queue, if the monitor is occupied, otherwise they enter in FIFO order.
    1805 Internal threads block on condition queues via @wait@ and reenter from the condition in FIFO order.
    1806 Alternatively, internal threads block on urgent from the @signal_block@ or @waitfor@, and reenter implicitly when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
     1802Internal threads block on condition queues via @wait@ and they reenter from the condition in FIFO order, or they block on urgent via @signal_block@ or @waitfor@ and reenter implicit when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
    18071803
    18081804There are three signalling mechanisms to unblock waiting threads to enter the monitor.
    1809 Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion, so either the signaller or signallee can proceed.
     1805Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion so only one can proceed.
    18101806For internal scheduling, threads are unblocked from condition queues using @signal@, where the signallee is moved to urgent and the signaller continues (solid line).
    18111807Multiple signals move multiple signallees to urgent, until the condition is empty.
     
    18471843It is common to declare condition variables as monitor fields to prevent shared access, hence no locking is required for access as the conditions are protected by the monitor lock.
    18481844In \CFA, a condition variable can be created/stored independently.
    1849 % To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolean test to detect sharing from other monitors.
     1845To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolen test to detect sharing from other monitors.
    18501846
    18511847% Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means:
     
    18561852% The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait).
    18571853% \item
    1858 % The signalling thread blocks but is marked for urgent unblocking at the next scheduling point and the signalled thread continues.
     1854% The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues.
    18591855% \end{enumerate}
    18601856% The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}).
     
    19651961External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion.
    19661962If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer.
    1967 Threads calling excluded functions block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
     1963Calls threads to functions that are currently excluded block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
    19681964Figure~\ref{f:RWExt} shows a readers/writer lock written using external scheduling, where a waiting reader detects a writer using the resource and restricts further calls until the writer exits by calling @EndWrite@.
    19691965The writer does a similar action for each reader or writer using the resource.
     
    20802076For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex parameters, \ie @wait( e, m1, m2 )@.
    20812077To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@.
    2082 Wait cannot statically verifies the released monitors are the acquired mutex-parameters without disallowing separately compiled helper functions calling @wait@.
     2078Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe.
    20832079While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric.
    20842080Finally, a signaller,
     
    20922088Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex parameters, \ie @waitfor( rtn, m1, m2 )@.
    20932089To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@.
    2094 @waitfor@ does statically verify the monitor types passed are the same as the acquired mutex-parameters of the given function or function pointer, hence the function (pointer) prototype must be accessible.
     2090@waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer.
     2091To statically verify the released monitors match with the accepted function's mutex parameters, the function (pointer) prototype must be accessible.
    20952092% When an overloaded function appears in an @waitfor@ statement, calls to any function with that name are accepted.
    20962093% The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call.
     
    21512148The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true.
    21522149
    2153 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated, \eg assume the bounded buffer is restructred from a monitor to a thread with the following @main@.
    2154 \begin{cfa}
    2155 void main( Buffer(T) & buffer ) with(buffer) {
    2156         for () {
    2157                 `waitfor( ^?{}, buffer )` break;
    2158                 or when ( count != 20 ) waitfor( insert, buffer ) { ... }
    2159                 or when ( count != 0 ) waitfor( remove, buffer ) { ... }
    2160         }
    2161         // clean up
    2162 }
    2163 \end{cfa}
    2164 When the program main deallocates the buffer, it first calls the buffer's destructor, which is accepted, the destructor runs, and the buffer is deallocated.
    2165 However, the buffer thread cannot continue after the destructor call because the object is gone;
    2166 hence, clean up in @main@ cannot occur, which means destructors for local objects are not run.
    2167 To make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the destructor call is placed on urgent and the acceptor continues execution, which ends the loop, cleans up, and the thread terminates.
    2168 Then, the destructor caller unblocks from urgent to deallocate the object.
    2169 Accepting the destructor is the idiomatic way in \CFA to terminate a thread performing direct communication.
     2150An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated.
     2151\begin{cfa}
     2152void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) {
     2153        if ( count == 10 )
     2154                waitfor( remove, buffer ) {
     2155                        // insert elem into buffer
     2156                } or `waitfor( ^?{}, buffer )` throw insertFail;
     2157}
     2158\end{cfa}
     2159When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action.
     2160However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined.
     2161Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object.
     2162Accepting the destructor is the idiomatic way to terminate a thread in \CFA.
    21702163
    21712164
     
    23642357
    23652358struct Msg { int i, j; };
    2366 thread GoRtn { int i;  float f;  Msg m; };
    2367 void mem1( GoRtn & mutex gortn, int i ) { gortn.i = i; }
    2368 void mem2( GoRtn & mutex gortn, float f ) { gortn.f = f; }
    2369 void mem3( GoRtn & mutex gortn, Msg m ) { gortn.m = m; }
    2370 void ^?{}( GoRtn & mutex ) {}
    2371 
    2372 void main( GoRtn & gortn ) with( gortn ) {  // thread starts
     2359thread Gortn { int i;  float f;  Msg m; };
     2360void mem1( Gortn & mutex gortn, int i ) { gortn.i = i; }
     2361void mem2( Gortn & mutex gortn, float f ) { gortn.f = f; }
     2362void mem3( Gortn & mutex gortn, Msg m ) { gortn.m = m; }
     2363void ^?{}( Gortn & mutex ) {}
     2364
     2365void main( Gortn & gortn ) with( gortn ) {  // thread starts
    23732366
    23742367        for () {
     
    23832376}
    23842377int main() {
    2385         GoRtn gortn; $\C[2.0in]{// start thread}$
     2378        Gortn gortn; $\C[2.0in]{// start thread}$
    23862379        `mem1( gortn, 0 );` $\C{// different calls}\CRT$
    23872380        `mem2( gortn, 2.5 );`
     
    25412534% However, preemption is necessary for fairness and to reduce tail-latency.
    25422535% For concurrency that relies on spinning, if all cores spin the system is livelocked, whereas preemption breaks the livelock.
    2543 
    2544 
    2545 \begin{comment}
    2546 \subsection{Thread Pools}
    2547 
    2548 In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
    2549 If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
    2550 While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
    2551 Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
    2552 While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
    2553 As well, concurrency errors return, which threads pools are suppose to mitigate.
    2554 
    2555 \begin{figure}
    2556 \centering
    2557 \begin{tabular}{@{}l|l@{}}
    2558 \begin{cfa}
    2559 struct Adder {
    2560     int * row, cols;
    2561 };
    2562 int operator()() {
    2563         subtotal = 0;
    2564         for ( int c = 0; c < cols; c += 1 )
    2565                 subtotal += row[c];
    2566         return subtotal;
    2567 }
    2568 void ?{}( Adder * adder, int row[$\,$], int cols, int & subtotal ) {
    2569         adder.[rows, cols, subtotal] = [rows, cols, subtotal];
    2570 }
    2571 
    2572 
    2573 
    2574 
    2575 \end{cfa}
    2576 &
    2577 \begin{cfa}
    2578 int main() {
    2579         const int rows = 10, cols = 10;
    2580         int matrix[rows][cols], subtotals[rows], total = 0;
    2581         // read matrix
    2582         Executor executor( 4 ); // kernel threads
    2583         Adder * adders[rows];
    2584         for ( r; rows ) { // send off work for executor
    2585                 adders[r] = new( matrix[r], cols, &subtotal[r] );
    2586                 executor.send( *adders[r] );
    2587         }
    2588         for ( r; rows ) {       // wait for results
    2589                 delete( adders[r] );
    2590                 total += subtotals[r];
    2591         }
    2592         sout | total;
    2593 }
    2594 \end{cfa}
    2595 \end{tabular}
    2596 \caption{Executor}
    2597 \end{figure}
    2598 \end{comment}
     2536%
     2537%
     2538% \subsection{Thread Pools}
     2539%
     2540% In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
     2541% If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
     2542% While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
     2543% Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
     2544% While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
     2545% As well, concurrency errors return, which threads pools are suppose to mitigate.
    25992546
    26002547
     
    26202567The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults.
    26212568The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors.
    2622 However, the design allows changing the scheduler, \eg multi-queue multi-server with work-stealing/sharing across the virtual processors.
     2569However, the scheduler is pluggable, supporting alternative schedulers, such as multi-queue multi-server, with work-stealing/sharing across the virtual processors.
    26232570If several clusters exist, both threads and virtual processors, can be explicitly migrated from one cluster to another.
    26242571No automatic load balancing among clusters is performed by \CFA.
     
    26272574The user cluster is created to contain the application user-threads.
    26282575Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime.
    2629 However, because of limitations of scheduling requirements (real-time), NUMA architecture, heterogeneous hardware, or issues with the underlying operating system, multiple clusters are sometimes necessary.
     2576However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), multiple clusters are sometimes necessary.
    26302577
    26312578
     
    26712618\subsection{Preemption}
    26722619
    2673 Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on code between cooperative scheduling to be atomic.
    2674 This atomic reliance can fail on multi-core machines, because execution across cores is nondeterministic.
    2675 A different reason for not supporting preemption is that it significantly complicates the runtime system, \eg Microsoft runtime does not support interrupts and on Linux systems, interrupts are complex (see below).
     2620Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on section of code between cooperative scheduling to be atomic.
     2621A separate reason for not supporting preemption is that it significantly complicates the runtime system.
    26762622Preemption is normally handled by setting a count-down timer on each virtual processor.
    26772623When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code.
     
    26822628Because preemption frequency is usually long (1 millisecond) performance cost is negligible.
    26832629
    2684 Linux switched a decade ago from specific to arbitrary process signal-delivery for applications with multiple kernel threads.
     2630However, on current Linux systems:
    26852631\begin{cquote}
    26862632A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked.
     
    26882634SIGNAL(7) - Linux Programmer's Manual
    26892635\end{cquote}
    2690 Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to an application, is delivered to any of its Linux subprocesses (kernel threads).
    2691 To ensure each virtual processor receives a preemption signal, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
     2636Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to the Linux process, is delivered to any of its Linux subprocesses (kernel threads).
     2637To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
    26922638Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order.
    26932639The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed.
     
    27062652
    27072653To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0.
    2708 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \uC/\CFA are compiled with gcc 6.5.
     2654The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.3 LTS and \uC and \CFA are compiled with gcc 6.3.
    27092655
    27102656\begin{comment}
     
    27612707\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    27622708\begin{cfa}
    2763 @thread@ MyThread {};
    2764 void @main@( MyThread & ) {}
     2709thread MyThread {};
     2710void main( MyThread & ) {}
    27652711int main() {
    27662712        BENCH( for ( N ) { @MyThread m;@ } )
     
    28042750\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    28052751\begin{cfa}[aboveskip=0pt,belowskip=0pt]
    2806 @coroutine@ C {} c;
     2752coroutine C {} c;
    28072753void main( C & ) { for ( ;; ) { @suspend;@ } }
    28082754int main() { // coroutine test
     
    28252771\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    28262772\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2827 C function              & 2                     & 2             & 0             \\
    2828 \CFA generator  & 2                     & 2             & 0             \\
     2773Kernel Thread   & 333.5 & 332.96        & 4.1   \\
    28292774\CFA Coroutine  & 49    & 48.68         & 0.47  \\
    28302775\CFA Thread             & 105   & 105.57        & 1.37  \\
     
    28322777\uC Thread              & 100   & 99.29         & 0.96  \\
    28332778Goroutine               & 145   & 147.25        & 4.15  \\
    2834 Java Thread             & 373.5 & 375.14        & 8.72  \\
    2835 Pthreads Thread & 333.5 & 332.96        & 4.1
     2779Java Thread             & 373.5 & 375.14        & 8.72
    28362780\end{tabular}
    28372781\end{multicols}
     
    28492793\lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
    28502794\begin{cfa}
    2851 @monitor@ M {} m1/*, m2, m3, m4*/;
     2795monitor M {} m1/*, m2, m3, m4*/;
    28522796void __attribute__((noinline))
    2853 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
     2797do_call( M & mutex m/*, m2, m3, m4*/ ) {}
    28542798int main() {
    28552799        BENCH(
    2856                 for( N ) do_call( m1/*, m2, m3, m4*/ );
     2800                for( N ) @do_call( m1/*, m2, m3, m4*/ );@
    28572801        )
    28582802        sout | result`ns;
     
    28692813\begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
    28702814\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    2871 test and test-and-test lock             & 26            & 26    & 0             \\
     2815C function                                              & 2                     & 2             & 0             \\
     2816FetchAdd + FetchSub                             & 26            & 26    & 0             \\
    28722817Pthreads Mutex Lock                             & 31            & 31.71 & 0.97  \\
    28732818\uC @monitor@ member rtn.               & 31            & 31    & 0             \\
     
    28752820\CFA @mutex@ function, 2 arg.   & 84            & 85.36 & 1.99  \\
    28762821\CFA @mutex@ function, 4 arg.   & 158           & 161   & 4.22  \\
    2877 Java synchronized method                & 27.5          & 29.79 & 2.93
     2822Java synchronized function              & 27.5          & 29.79 & 2.93
    28782823\end{tabular}
    28792824\end{multicols}
     
    28912836\begin{cfa}
    28922837volatile int go = 0;
    2893 @monitor@ M { @condition c;@ } m;
     2838monitor M { condition c; } m;
    28942839void __attribute__((noinline))
    2895 do_call( M & @mutex@ a1 ) { @signal( c );@ }
     2840do_call( M & mutex a1 ) { @signal( c );@ }
    28962841thread T {};
    28972842void main( T & this ) {
     
    29242869\multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
    29252870Pthreads Cond. Variable         & 6005          & 5681.43       & 835.45        \\
    2926 \uC @signal@                            & 324           & 325.54        & 3.02          \\
     2871\uC @signal@                            & 324           & 325.54        & 3,02          \\
    29272872\CFA @signal@, 1 @monitor@      & 368.5         & 370.61        & 4.77          \\
    29282873\CFA @signal@, 2 @monitor@      & 467           & 470.5         & 6.79          \\
     
    29442889\begin{cfa}
    29452890volatile int go = 0;
    2946 @monitor@ M {} m;
     2891monitor M {} m;
    29472892thread T {};
    29482893void __attribute__((noinline))
    2949 do_call( M & @mutex@ ) {}
     2894do_call( M & mutex ) {}
    29502895void main( T & ) {
    29512896        while ( go == 0 ) { yield(); }
    2952         while ( go == 1 ) { do_call( m ); }
     2897        while ( go == 1 ) { @do_call( m );@ }
    29532898}
    29542899int __attribute__((noinline))
    2955 do_wait( M & @mutex@ m ) {
     2900do_wait( M & mutex m ) {
    29562901        go = 1; // continue other thread
    29572902        BENCH( for ( N ) { @waitfor( do_call, m );@ } )
Note: See TracChangeset for help on using the changeset viewer.