Context Navigation

Reverse Diff

Paper.tex [8f079f0:4487667]

File:

: 1 edited

doc/papers/concurrency/Paper.tex (modified) (40 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/papers/concurrency/Paper.tex

-              r8f079f0
+              r4487667
                 __float80, float80, __float128, float128, forall, ftype, generator, _Generic, _Imaginary, __imag, __imag__,
                 inline, __inline, __inline__, __int128, int128, __label__, monitor, mutex, _Noreturn, one_t, or,
                 otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, thread,
+                otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, suspend, thread,
                 _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__,
                 virtual, __volatile, __volatile__, waitfor, when, with, zero_t},
 …
 However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone;
 no high-level language concurrency features are defined.
 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-9 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
+Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
 Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
 …
 From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}, including putting green threads back into Java~\cite{Quasar}.
 The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}.
 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{Adya02,vonBehren03}.
+As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}.
 Finally, performant user-threading implementations (both time and space) meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
 …
 Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types.
 \begin{cfa}
 int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$
 int ?()( Fib & fib, int N ) { for ( N - 1 ) `fib()`; return `fib()`; } $\C{// use function-call interface to skip N values}$
 double ?()( Fib & fib ) { return (int)`fib()` / 3.14159; } $\C{// different return type, cast prevents recursive call}\CRT$
 sout | (int)f1() | (double)f1() | f2( 2 ); // alternative interface, cast selects call based on return type, step 2 values
+int ?()( Fib & fib ) with( fib ) { return `resume( fib )`.fn; }   // function-call interface
+int ?()( Fib & fib, int N ) with( fib ) { for ( N - 1 ) `fib()`; return `fib()`; }   // use simple interface
+double ?()( Fib & fib ) with( fib ) { return (int)`fib()` / 3.14159; } // cast prevents recursive call
+sout | (int)f1() | (double)f1() | f2( 2 );   // simple interface, cast selects call based on return type, step 2 values
 \end{cfa}
 Now, the generator can be a separately-compiled opaque-type only accessed through its interface functions.
 …
 With respect to safety, we believe static analysis can discriminate local state from temporary variables in a generator, \ie variable usage spanning @suspend@, and generate a compile-time error.
 Finally, our current experience is that most generator problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small.
 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small, fast generators.
+As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators.
 Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored.
 …
 This semantics is basically a tail-call optimization, which compilers already perform.
 The example shows the assembly code to undo the generator's entry code before the direct jump.
 This assembly code depends on what entry code is generated, specifically if there are local variables, and the level of optimization.
+This assembly code depends on what entry code is generated, specifically if there are local variables and the level of optimization.
 To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time.
 Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing.
 A compiler could also eliminate other artifacts in the generator simulation to further increase performance, \eg LLVM has various coroutine support~\cite{CoroutineTS}, and \CFA can leverage this support should it fork @clang@.
+A compiler could also eliminate other artifacts in the generator simulation to further increase performance.
 \begin{figure}
 …
 Hence, a compromise solution is necessary that works for asymmetric (acyclic) and symmetric (cyclic) coroutines.
+Our solution is to context switch back to the first resumer (starter) once the coroutine ends.
+This semantics works well for the most common asymmetric and symmetric coroutine usage-patterns.
+Our solution for coroutine termination works well for the most common asymmetric and symmetric coroutine usage-patterns.
 For asymmetric coroutines, it is common for the first resumer (starter) coroutine to be the only resumer.
 All previous generators converted to coroutines have this property.
 …
 \subsection{Generator / Coroutine Implementation}
+\subsection{(Generator) Coroutine Implementation}
 A significant implementation challenge for generators/coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack.
 …
 class myCoroutine inherits baseCoroutine { ... }
 \end{cfa}
 % The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
 The problem is that some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
+The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.
+As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.
 Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors.
 An alternative is composition:
 …
 However, there is nothing preventing wrong placement or multiple declarations.
+\CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code
+% and when it is unsafe to perform certain optimizations,
+and IDEs using simple parsing can find and manipulate types with special properties.
+\CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code and when it is unsafe to perform certain optimizations, and IDEs using simple parsing can find and manipulate types with special properties.
 The downside of this approach is that it makes custom types a special case in the language.
 Users wanting to extend custom types or build their own can only do so in ways offered by the language.
 …
 \end{cfa}
 Note, copying generators/coroutines/threads is not meaningful.
 For example, both the resumer and suspender descriptors can have bi-directional pointers;
 copying these coroutines does not update the internal pointers so behaviour of both copies would be difficult to understand.
+For example, a coroutine retains its last resumer and suspends back to it;
+having a copy also suspend back to the same resumer is undefined semantics.
 Furthermore, two coroutines cannot logically execute on the same stack.
 A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy.
 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer).
 The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the coroutine descriptor from its handle.
+The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle.
 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones.
 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@.
 …
 \end{tabular}
 \end{cquote}
 Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_thread@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
 Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the thread descriptor from its handle, and a special destructor to prevent deallocation while the thread is executing.
+Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_coroutine@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).
+Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing.
 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.)
 The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack and starts running the coroutine main on the stack;
 …
 % Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data.
 % Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity.
 Similarly, the function definitions ensures there is a mechanism to get (read) the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread using the shared data.
+Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data.
 The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics.
 …
 called \newterm{bulk acquire}.
 \CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock.
 Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem~\cite{BankTransfer}, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
+Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.
 A \CFA programmer only has to manage when to acquire mutual exclusion;
 a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options.
 …
 Figure~\ref{f:MonitorScheduling} shows general internal/external scheduling (for the bounded-buffer example in Figure~\ref{f:InternalExternalScheduling}).
 External calling threads block on the calling queue, if the monitor is occupied, otherwise they enter in FIFO order.
+Internal threads block on condition queues via @wait@ and reenter from the condition in FIFO order.
+Alternatively, internal threads block on urgent from the @signal_block@ or @waitfor@, and reenter implicitly when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
+Internal threads block on condition queues via @wait@ and they reenter from the condition in FIFO order, or they block on urgent via @signal_block@ or @waitfor@ and reenter implicit when the monitor becomes empty, \ie, the thread in the monitor exits or waits.
 There are three signalling mechanisms to unblock waiting threads to enter the monitor.
 Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion, so either the signaller or signallee can proceed.
+Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion so only one can proceed.
 For internal scheduling, threads are unblocked from condition queues using @signal@, where the signallee is moved to urgent and the signaller continues (solid line).
 Multiple signals move multiple signallees to urgent, until the condition is empty.
 …
 It is common to declare condition variables as monitor fields to prevent shared access, hence no locking is required for access as the conditions are protected by the monitor lock.
 In \CFA, a condition variable can be created/stored independently.
 % To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolean test to detect sharing from other monitors.
+To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolen test to detect sharing from other monitors.
 % Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means:
 …
 % The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait).
 % \item
 % The signalling thread blocks but is marked for urgent unblocking at the next scheduling point and the signalled thread continues.
+% The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues.
 % \end{enumerate}
 % The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}).
 …
 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion.
 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer.
 Threads calling excluded functions block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
+Calls threads to functions that are currently excluded block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.
 Figure~\ref{f:RWExt} shows a readers/writer lock written using external scheduling, where a waiting reader detects a writer using the resource and restricts further calls until the writer exits by calling @EndWrite@.
 The writer does a similar action for each reader or writer using the resource.
 …
 For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex parameters, \ie @wait( e, m1, m2 )@.
 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@.
 Wait cannot statically verifies the released monitors are the acquired mutex-parameters without disallowing separately compiled helper functions calling @wait@.
+Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe.
 While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric.
 Finally, a signaller,
 …
 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex parameters, \ie @waitfor( rtn, m1, m2 )@.
 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@.
+@waitfor@ does statically verify the monitor types passed are the same as the acquired mutex-parameters of the given function or function pointer, hence the function (pointer) prototype must be accessible.
+@waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer.
+To statically verify the released monitors match with the accepted function's mutex parameters, the function (pointer) prototype must be accessible.
 % When an overloaded function appears in an @waitfor@ statement, calls to any function with that name are accepted.
 % The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call.
 …
 The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true.
+An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated, \eg assume the bounded buffer is restructred from a monitor to a thread with the following @main@.
+\begin{cfa}
+void main( Buffer(T) & buffer ) with(buffer) {
+        for () {
+                `waitfor( ^?{}, buffer )` break;
+                or when ( count != 20 ) waitfor( insert, buffer ) { ... }
+                or when ( count != 0 ) waitfor( remove, buffer ) { ... }
+        }
+        // clean up
+}
+\end{cfa}
+When the program main deallocates the buffer, it first calls the buffer's destructor, which is accepted, the destructor runs, and the buffer is deallocated.
+However, the buffer thread cannot continue after the destructor call because the object is gone;
+hence, clean up in @main@ cannot occur, which means destructors for local objects are not run.
+To make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the destructor call is placed on urgent and the acceptor continues execution, which ends the loop, cleans up, and the thread terminates.
+Then, the destructor caller unblocks from urgent to deallocate the object.
+Accepting the destructor is the idiomatic way in \CFA to terminate a thread performing direct communication.
+An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated.
+\begin{cfa}
+void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) {
+        if ( count == 10 )
+                waitfor( remove, buffer ) {
+                        // insert elem into buffer
+                } or `waitfor( ^?{}, buffer )` throw insertFail;
+}
+\end{cfa}
+When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action.
+However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined.
+Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object.
+Accepting the destructor is the idiomatic way to terminate a thread in \CFA.
 …
 struct Msg { int i, j; };
 thread GoRtn { int i;  float f;  Msg m; };
 void mem1( GoRtn & mutex gortn, int i ) { gortn.i = i; }
 void mem2( GoRtn & mutex gortn, float f ) { gortn.f = f; }
 void mem3( GoRtn & mutex gortn, Msg m ) { gortn.m = m; }
 void ^?{}( GoRtn & mutex ) {}
 void main( GoRtn & gortn ) with( gortn ) {  // thread starts
+thread Gortn { int i;  float f;  Msg m; };
+void mem1( Gortn & mutex gortn, int i ) { gortn.i = i; }
+void mem2( Gortn & mutex gortn, float f ) { gortn.f = f; }
+void mem3( Gortn & mutex gortn, Msg m ) { gortn.m = m; }
+void ^?{}( Gortn & mutex ) {}
+void main( Gortn & gortn ) with( gortn ) {  // thread starts
         for () {
 …
+}
 int main() {
         GoRtn gortn; $\C[2.0in]{// start thread}$
+        Gortn gortn; $\C[2.0in]{// start thread}$
         `mem1( gortn, 0 );` $\C{// different calls}\CRT$
         `mem2( gortn, 2.5 );`
 …
 % However, preemption is necessary for fairness and to reduce tail-latency.
 % For concurrency that relies on spinning, if all cores spin the system is livelocked, whereas preemption breaks the livelock.
+\begin{comment}
+\subsection{Thread Pools}
+In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
+If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
+While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
+Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
+While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
+As well, concurrency errors return, which threads pools are suppose to mitigate.
+\begin{figure}
+\centering
+\begin{tabular}{@{}l|l@{}}
+\begin{cfa}
+struct Adder {
+    int * row, cols;
+};
+int operator()() {
+        subtotal = 0;
+        for ( int c = 0; c < cols; c += 1 )
+                subtotal += row[c];
+        return subtotal;
+}
+void ?{}( Adder * adder, int row[$\,$], int cols, int & subtotal ) {
+        adder.[rows, cols, subtotal] = [rows, cols, subtotal];
+}
+\end{cfa}
+&
+\begin{cfa}
+int main() {
+        const int rows = 10, cols = 10;
+        int matrix[rows][cols], subtotals[rows], total = 0;
+        // read matrix
+        Executor executor( 4 ); // kernel threads
+        Adder * adders[rows];
+        for ( r; rows ) { // send off work for executor
+                adders[r] = new( matrix[r], cols, &subtotal[r] );
+                executor.send( *adders[r] );
+        }
+        for ( r; rows ) {       // wait for results
+                delete( adders[r] );
+                total += subtotals[r];
+        }
+        sout | total;
+}
+\end{cfa}
+\end{tabular}
+\caption{Executor}
+\end{figure}
+\end{comment}
+%
+%
+% \subsection{Thread Pools}
+%
+% In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution.
+% If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together.
+% While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs.
+% Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.
+% While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases.
+% As well, concurrency errors return, which threads pools are suppose to mitigate.
 …
 The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults.
 The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors.
 However, the design allows changing the scheduler, \eg multi-queue multi-server with work-stealing/sharing across the virtual processors.
+However, the scheduler is pluggable, supporting alternative schedulers, such as multi-queue multi-server, with work-stealing/sharing across the virtual processors.
 If several clusters exist, both threads and virtual processors, can be explicitly migrated from one cluster to another.
 No automatic load balancing among clusters is performed by \CFA.
 …
 The user cluster is created to contain the application user-threads.
 Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime.
 However, because of limitations of scheduling requirements (real-time), NUMA architecture, heterogeneous hardware, or issues with the underlying operating system, multiple clusters are sometimes necessary.
+However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), multiple clusters are sometimes necessary.
 …
 \subsection{Preemption}
+Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on code between cooperative scheduling to be atomic.
+This atomic reliance can fail on multi-core machines, because execution across cores is nondeterministic.
+A different reason for not supporting preemption is that it significantly complicates the runtime system, \eg Microsoft runtime does not support interrupts and on Linux systems, interrupts are complex (see below).
+Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on section of code between cooperative scheduling to be atomic.
+A separate reason for not supporting preemption is that it significantly complicates the runtime system.
 Preemption is normally handled by setting a count-down timer on each virtual processor.
 When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code.
 …
 Because preemption frequency is usually long (1 millisecond) performance cost is negligible.
+Linux switched a decade ago from specific to arbitrary process signal-delivery for applications with multiple kernel threads.
+However, on current Linux systems:
 \begin{cquote}
 A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked.
 …
 SIGNAL(7) - Linux Programmer's Manual
 \end{cquote}
 Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to an application, is delivered to any of its Linux subprocesses (kernel threads).
 To ensure each virtual processor receives a preemption signal, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
+Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to the Linux process, is delivered to any of its Linux subprocesses (kernel threads).
+To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.
 Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order.
 The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed.
 …
 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0.
 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and \uC/\CFA are compiled with gcc 6.5.
+The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.3 LTS and \uC and \CFA are compiled with gcc 6.3.
 \begin{comment}
 …
 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
 \begin{cfa}
 @thread@ MyThread {};
 void @main@( MyThread & ) {}
+thread MyThread {};
+void main( MyThread & ) {}
 int main() {
         BENCH( for ( N ) { @MyThread m;@ } )
 …
 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
 \begin{cfa}[aboveskip=0pt,belowskip=0pt]
 @coroutine@ C {} c;
+coroutine C {} c;
 void main( C & ) { for ( ;; ) { @suspend;@ } }
 int main() { // coroutine test
 …
 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+C function              & 2                     & 2             & 0             \\
+\CFA generator  & 2                     & 2             & 0             \\
+Kernel Thread   & 333.5 & 332.96        & 4.1   \\
 \CFA Coroutine  & 49    & 48.68         & 0.47  \\
 \CFA Thread             & 105   & 105.57        & 1.37  \\
 …
 \uC Thread              & 100   & 99.29         & 0.96  \\
 Goroutine               & 145   & 147.25        & 4.15  \\
+Java Thread             & 373.5 & 375.14        & 8.72  \\
+Pthreads Thread & 333.5 & 332.96        & 4.1
+Java Thread             & 373.5 & 375.14        & 8.72
 \end{tabular}
 \end{multicols}
 …
 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}}
 \begin{cfa}
 @monitor@ M {} m1/*, m2, m3, m4*/;
+monitor M {} m1/*, m2, m3, m4*/;
 void __attribute__((noinline))
 do_call( M & @mutex m/*, m2, m3, m4*/@ ) {}
+do_call( M & mutex m/*, m2, m3, m4*/ ) {}
 int main() {
         BENCH(
                 for( N ) do_call( m1/*, m2, m3, m4*/ );
+                for( N ) @do_call( m1/*, m2, m3, m4*/ );@
+        )
         sout | result`ns;
 …
 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}}
 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
+test and test-and-test lock             & 26            & 26    & 0             \\
+C function                                              & 2                     & 2             & 0             \\
+FetchAdd + FetchSub                             & 26            & 26    & 0             \\
 Pthreads Mutex Lock                             & 31            & 31.71 & 0.97  \\
 \uC @monitor@ member rtn.               & 31            & 31    & 0             \\
 …
 \CFA @mutex@ function, 2 arg.   & 84            & 85.36 & 1.99  \\
 \CFA @mutex@ function, 4 arg.   & 158           & 161   & 4.22  \\
 Java synchronized method                & 27.5          & 29.79 & 2.93
+Java synchronized function              & 27.5          & 29.79 & 2.93
 \end{tabular}
 \end{multicols}
 …
 \begin{cfa}
 volatile int go = 0;
 @monitor@ M { @condition c;@ } m;
+monitor M { condition c; } m;
 void __attribute__((noinline))
 do_call( M & @mutex@ a1 ) { @signal( c );@ }
+do_call( M & mutex a1 ) { @signal( c );@ }
 thread T {};
 void main( T & this ) {
 …
 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\
 Pthreads Cond. Variable         & 6005          & 5681.43       & 835.45        \\
 \uC @signal@                            & 324           & 325.54        & 3.02          \\
+\uC @signal@                            & 324           & 325.54        & 3,02          \\
 \CFA @signal@, 1 @monitor@      & 368.5         & 370.61        & 4.77          \\
 \CFA @signal@, 2 @monitor@      & 467           & 470.5         & 6.79          \\
 …
 \begin{cfa}
 volatile int go = 0;
 @monitor@ M {} m;
+monitor M {} m;
 thread T {};
 void __attribute__((noinline))
 do_call( M & @mutex@ ) {}
+do_call( M & mutex ) {}
 void main( T & ) {
         while ( go == 0 ) { yield(); }
         while ( go == 1 ) { do_call( m ); }
+        while ( go == 1 ) { @do_call( m );@ }
+}
 int __attribute__((noinline))
 do_wait( M & @mutex@ m ) {
+do_wait( M & mutex m ) {
         go = 1; // continue other thread
         BENCH( for ( N ) { @waitfor( do_call, m );@ } )

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changes in doc/papers/concurrency/Paper.tex [8f079f0:4487667]

Legend:

doc/papers/concurrency/Paper.tex

Download in other formats: