- File:
-
- 1 edited
-
doc/papers/concurrency/Paper.tex (modified) (40 diffs)
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r8f079f0 r4487667 157 157 __float80, float80, __float128, float128, forall, ftype, generator, _Generic, _Imaginary, __imag, __imag__, 158 158 inline, __inline, __inline__, __int128, int128, __label__, monitor, mutex, _Noreturn, one_t, or, 159 otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, thread,159 otype, restrict, __restrict, __restrict__, __signed, __signed__, _Static_assert, suspend, thread, 160 160 _Thread_local, throw, throwResume, timeout, trait, try, ttype, typeof, __typeof, __typeof__, 161 161 virtual, __volatile, __volatile__, waitfor, when, with, zero_t}, … … 303 303 However, \Celeven concurrency is largely wrappers for a subset of the pthreads library~\cite{Butenhof97,Pthreads}, and \Celeven and pthreads concurrency is simple, based on thread fork/join in a function and a few locks, which is low-level and error prone; 304 304 no high-level language concurrency features are defined. 305 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang- 9nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.305 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach. 306 306 Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}. 307 307 … … 313 313 From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,Marcel}, including putting green threads back into Java~\cite{Quasar}. 314 314 The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing medium work-units to facilitate load balancing by the runtime~\cite{Verch12}. 315 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{ Adya02,vonBehren03}.315 As well, user-threading facilitates a simpler concurrency approach using thread objects that leverage sequential patterns versus events with call-backs~\cite{vonBehren03}. 316 316 Finally, performant user-threading implementations (both time and space) meet or exceed direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety. 317 317 … … 613 613 Finally, an explicit generator type provides both design and performance benefits, such as multiple type-safe interface functions taking and returning arbitrary types. 614 614 \begin{cfa} 615 int ?()( Fib & fib ) { return `resume( fib )`.fn; } $\C[3.9in]{// function-call interface}$616 int ?()( Fib & fib, int N ) { for ( N - 1 ) `fib()`; return `fib()`; } $\C{// use function-call interface to skip N values}$617 double ?()( Fib & fib ) { return (int)`fib()` / 3.14159; } $\C{// different return type, cast prevents recursive call}\CRT$618 sout | (int)f1() | (double)f1() | f2( 2 ); // alternative interface, cast selects call based on return type, step 2 values615 int ?()( Fib & fib ) with( fib ) { return `resume( fib )`.fn; } // function-call interface 616 int ?()( Fib & fib, int N ) with( fib ) { for ( N - 1 ) `fib()`; return `fib()`; } // use simple interface 617 double ?()( Fib & fib ) with( fib ) { return (int)`fib()` / 3.14159; } // cast prevents recursive call 618 sout | (int)f1() | (double)f1() | f2( 2 ); // simple interface, cast selects call based on return type, step 2 values 619 619 \end{cfa} 620 620 Now, the generator can be a separately-compiled opaque-type only accessed through its interface functions. … … 628 628 With respect to safety, we believe static analysis can discriminate local state from temporary variables in a generator, \ie variable usage spanning @suspend@, and generate a compile-time error. 629 629 Finally, our current experience is that most generator problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small. 630 As well, C programmers are not afraid ofthis kind of semantic programming requirement, if it results in very small, fast generators.630 As well, C programmers are not afraid with this kind of semantic programming requirement, if it results in very small, fast generators. 631 631 632 632 Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored. … … 796 796 This semantics is basically a tail-call optimization, which compilers already perform. 797 797 The example shows the assembly code to undo the generator's entry code before the direct jump. 798 This assembly code depends on what entry code is generated, specifically if there are local variables ,and the level of optimization.798 This assembly code depends on what entry code is generated, specifically if there are local variables and the level of optimization. 799 799 To provide this new calling convention requires a mechanism built into the compiler, which is beyond the scope of \CFA at this time. 800 800 Nevertheless, it is possible to hand generate any symmetric generators for proof of concept and performance testing. 801 A compiler could also eliminate other artifacts in the generator simulation to further increase performance , \eg LLVM has various coroutine support~\cite{CoroutineTS}, and \CFA can leverage this support should it fork @clang@.801 A compiler could also eliminate other artifacts in the generator simulation to further increase performance. 802 802 803 803 \begin{figure} … … 1213 1213 Hence, a compromise solution is necessary that works for asymmetric (acyclic) and symmetric (cyclic) coroutines. 1214 1214 1215 Our solution is to context switch back to the first resumer (starter) once the coroutine ends. 1216 This semantics works well for the most common asymmetric and symmetric coroutine usage-patterns. 1215 Our solution for coroutine termination works well for the most common asymmetric and symmetric coroutine usage-patterns. 1217 1216 For asymmetric coroutines, it is common for the first resumer (starter) coroutine to be the only resumer. 1218 1217 All previous generators converted to coroutines have this property. … … 1246 1245 1247 1246 1248 \subsection{ Generator /Coroutine Implementation}1247 \subsection{(Generator) Coroutine Implementation} 1249 1248 1250 1249 A significant implementation challenge for generators/coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg stack. … … 1255 1254 class myCoroutine inherits baseCoroutine { ... } 1256 1255 \end{cfa} 1257 %The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types.1258 The problem is thatsome special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived.1256 The problem is that the programming language and its tool chain, \eg debugger, @valgrind@, need to understand @baseCoroutine@ because it infers special property, so type @baseCoroutine@ becomes a de facto keyword and all types inheriting from it are implicitly custom types. 1257 As well, some special properties are not handled by existing language semantics, \eg the execution of constructors/destructors is in the wrong order to implicitly start threads because the thread must start \emph{after} all constructors as it relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. 1259 1258 Alternatives, such as explicitly starting threads as in Java, are repetitive and forgetting to call start is a common source of errors. 1260 1259 An alternative is composition: … … 1268 1267 However, there is nothing preventing wrong placement or multiple declarations. 1269 1268 1270 \CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code 1271 % and when it is unsafe to perform certain optimizations, 1272 and IDEs using simple parsing can find and manipulate types with special properties. 1269 \CFA custom types make any special properties explicit to the language and its tool chain, \eg the language code-generator knows where to inject code and when it is unsafe to perform certain optimizations, and IDEs using simple parsing can find and manipulate types with special properties. 1273 1270 The downside of this approach is that it makes custom types a special case in the language. 1274 1271 Users wanting to extend custom types or build their own can only do so in ways offered by the language. … … 1285 1282 \end{cfa} 1286 1283 Note, copying generators/coroutines/threads is not meaningful. 1287 For example, both the resumer and suspender descriptors can have bi-directional pointers;1288 copying these coroutines does not update the internal pointers so behaviour of both copies would be difficult to understand.1284 For example, a coroutine retains its last resumer and suspends back to it; 1285 having a copy also suspend back to the same resumer is undefined semantics. 1289 1286 Furthermore, two coroutines cannot logically execute on the same stack. 1290 1287 A deep coroutine copy, which copies the stack, is also meaningless in an unmanaged language (no garbage collection), like C, because the stack may contain pointers to object within it that require updating for the copy. 1291 1288 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference (pointer). 1292 The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the c oroutine descriptor from itshandle.1289 The function definitions ensures there is a statically-typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. 1293 1290 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input/output values versus fixed ones. 1294 1291 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@. … … 1496 1493 \end{tabular} 1497 1494 \end{cquote} 1498 Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_ thread@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer).1499 Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the thread descriptor from itshandle, and a special destructor to prevent deallocation while the thread is executing.1495 Like coroutines, the @dtype@ property prevents \emph{implicit} copy operations and the @is_coroutine@ trait provides no \emph{explicit} copy operations, so threads must be passed by reference (pointer). 1496 Similarly, the function definitions ensures there is a statically-typed @main@ function that is the thread starting point (first stack frame), a mechanism to get (read) the currently executing thread handle, and a special destructor to prevent deallocation while the thread is executing. 1500 1497 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.) 1501 1498 The difference between the coroutine and thread is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates the coroutine's stack and starts running the coroutine main on the stack; … … 1620 1617 % Copying a lock is insecure because it is possible to copy an open lock and then use the open copy when the original lock is closed to simultaneously access the shared data. 1621 1618 % Copying a monitor is secure because both the lock and shared data are copies, but copying the shared data is meaningless because it no longer represents a unique entity. 1622 Similarly, the function definitions ensures there is a mechanism to get (read) the monitor descriptor from itshandle, and a special destructor to prevent deallocation if a thread using the shared data.1619 Similarly, the function definitions ensures there is a mechanism to get (read) the currently executing monitor handle, and a special destructor to prevent deallocation if a thread using the shared data. 1623 1620 The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics. 1624 1621 … … 1659 1656 called \newterm{bulk acquire}. 1660 1657 \CFA guarantees acquisition order is consistent across calls to @mutex@ functions using the same monitors as arguments, so acquiring multiple monitors is safe from deadlock. 1661 Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem ~\cite{BankTransfer}, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking.1658 Figure~\ref{f:BankTransfer} shows a trivial solution to the bank transfer problem, where two resources must be locked simultaneously, using \CFA monitors with implicit locking and \CC with explicit locking. 1662 1659 A \CFA programmer only has to manage when to acquire mutual exclusion; 1663 1660 a \CC programmer must select the correct lock and acquisition mechanism from a panoply of locking options. … … 1803 1800 Figure~\ref{f:MonitorScheduling} shows general internal/external scheduling (for the bounded-buffer example in Figure~\ref{f:InternalExternalScheduling}). 1804 1801 External calling threads block on the calling queue, if the monitor is occupied, otherwise they enter in FIFO order. 1805 Internal threads block on condition queues via @wait@ and reenter from the condition in FIFO order. 1806 Alternatively, internal threads block on urgent from the @signal_block@ or @waitfor@, and reenter implicitly when the monitor becomes empty, \ie, the thread in the monitor exits or waits. 1802 Internal threads block on condition queues via @wait@ and they reenter from the condition in FIFO order, or they block on urgent via @signal_block@ or @waitfor@ and reenter implicit when the monitor becomes empty, \ie, the thread in the monitor exits or waits. 1807 1803 1808 1804 There are three signalling mechanisms to unblock waiting threads to enter the monitor. 1809 Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion , so either the signaller or signallee can proceed.1805 Note, signalling cannot have the signaller and signalled thread in the monitor simultaneously because of the mutual exclusion so only one can proceed. 1810 1806 For internal scheduling, threads are unblocked from condition queues using @signal@, where the signallee is moved to urgent and the signaller continues (solid line). 1811 1807 Multiple signals move multiple signallees to urgent, until the condition is empty. … … 1847 1843 It is common to declare condition variables as monitor fields to prevent shared access, hence no locking is required for access as the conditions are protected by the monitor lock. 1848 1844 In \CFA, a condition variable can be created/stored independently. 1849 % To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolean test to detect sharing from other monitors.1845 To still prevent expensive locking on access, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, resulting in a low-cost boolen test to detect sharing from other monitors. 1850 1846 1851 1847 % Signalling semantics cannot have the signaller and signalled thread in the monitor simultaneously, which means: … … 1856 1852 % The signalling thread continues and the signalled thread is marked for urgent unblocking at the next scheduling point (exit/wait). 1857 1853 % \item 1858 % The signalling thread blocks but is marked for urg ent unblocking at the next scheduling point and the signalled thread continues.1854 % The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues. 1859 1855 % \end{enumerate} 1860 1856 % The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}). … … 1965 1961 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the function calls that can next acquire mutual exclusion. 1966 1962 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. 1967 Threads calling excluded functionsblock outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor.1963 Calls threads to functions that are currently excluded block outside of (external to) the monitor on the calling queue, versus blocking on condition queues inside of (internal to) the monitor. 1968 1964 Figure~\ref{f:RWExt} shows a readers/writer lock written using external scheduling, where a waiting reader detects a writer using the resource and restricts further calls until the writer exits by calling @EndWrite@. 1969 1965 The writer does a similar action for each reader or writer using the resource. … … 2080 2076 For @wait( e )@, the default semantics is to atomically block the signaller and release all acquired mutex parameters, \ie @wait( e, m1, m2 )@. 2081 2077 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @wait( e, m1 )@. 2082 Wait cannot statically verifies the released monitors are the acquired mutex-parameters without disallowing separately compiled helper functions calling @wait@.2078 Wait statically verifies the released monitors are the acquired mutex-parameters so unconditional release is safe. 2083 2079 While \CC supports bulk locking, @wait@ only accepts a single lock for a condition variable, so bulk locking with condition variables is asymmetric. 2084 2080 Finally, a signaller, … … 2092 2088 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex parameters, \ie @waitfor( rtn, m1, m2 )@. 2093 2089 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@. 2094 @waitfor@ does statically verify the monitor types passed are the same as the acquired mutex-parameters of the given function or function pointer, hence the function (pointer) prototype must be accessible. 2090 @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given function or function pointer. 2091 To statically verify the released monitors match with the accepted function's mutex parameters, the function (pointer) prototype must be accessible. 2095 2092 % When an overloaded function appears in an @waitfor@ statement, calls to any function with that name are accepted. 2096 2093 % The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call. … … 2151 2148 The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true. 2152 2149 2153 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated, \eg assume the bounded buffer is restructred from a monitor to a thread with the following @main@. 2154 \begin{cfa} 2155 void main( Buffer(T) & buffer ) with(buffer) { 2156 for () { 2157 `waitfor( ^?{}, buffer )` break; 2158 or when ( count != 20 ) waitfor( insert, buffer ) { ... } 2159 or when ( count != 0 ) waitfor( remove, buffer ) { ... } 2160 } 2161 // clean up 2162 } 2163 \end{cfa} 2164 When the program main deallocates the buffer, it first calls the buffer's destructor, which is accepted, the destructor runs, and the buffer is deallocated. 2165 However, the buffer thread cannot continue after the destructor call because the object is gone; 2166 hence, clean up in @main@ cannot occur, which means destructors for local objects are not run. 2167 To make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the destructor call is placed on urgent and the acceptor continues execution, which ends the loop, cleans up, and the thread terminates. 2168 Then, the destructor caller unblocks from urgent to deallocate the object. 2169 Accepting the destructor is the idiomatic way in \CFA to terminate a thread performing direct communication. 2150 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated. 2151 \begin{cfa} 2152 void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) { 2153 if ( count == 10 ) 2154 waitfor( remove, buffer ) { 2155 // insert elem into buffer 2156 } or `waitfor( ^?{}, buffer )` throw insertFail; 2157 } 2158 \end{cfa} 2159 When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action. 2160 However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined. 2161 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object. 2162 Accepting the destructor is the idiomatic way to terminate a thread in \CFA. 2170 2163 2171 2164 … … 2364 2357 2365 2358 struct Msg { int i, j; }; 2366 thread Go Rtn { int i; float f; Msg m; };2367 void mem1( Go Rtn & mutex gortn, int i ) { gortn.i = i; }2368 void mem2( Go Rtn & mutex gortn, float f ) { gortn.f = f; }2369 void mem3( Go Rtn & mutex gortn, Msg m ) { gortn.m = m; }2370 void ^?{}( Go Rtn & mutex ) {}2371 2372 void main( Go Rtn & gortn ) with( gortn ) { // thread starts2359 thread Gortn { int i; float f; Msg m; }; 2360 void mem1( Gortn & mutex gortn, int i ) { gortn.i = i; } 2361 void mem2( Gortn & mutex gortn, float f ) { gortn.f = f; } 2362 void mem3( Gortn & mutex gortn, Msg m ) { gortn.m = m; } 2363 void ^?{}( Gortn & mutex ) {} 2364 2365 void main( Gortn & gortn ) with( gortn ) { // thread starts 2373 2366 2374 2367 for () { … … 2383 2376 } 2384 2377 int main() { 2385 Go Rtn gortn; $\C[2.0in]{// start thread}$2378 Gortn gortn; $\C[2.0in]{// start thread}$ 2386 2379 `mem1( gortn, 0 );` $\C{// different calls}\CRT$ 2387 2380 `mem2( gortn, 2.5 );` … … 2541 2534 % However, preemption is necessary for fairness and to reduce tail-latency. 2542 2535 % For concurrency that relies on spinning, if all cores spin the system is livelocked, whereas preemption breaks the livelock. 2543 2544 2545 \begin{comment} 2546 \subsection{Thread Pools} 2547 2548 In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution. 2549 If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. 2550 While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. 2551 Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers. 2552 While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. 2553 As well, concurrency errors return, which threads pools are suppose to mitigate. 2554 2555 \begin{figure} 2556 \centering 2557 \begin{tabular}{@{}l|l@{}} 2558 \begin{cfa} 2559 struct Adder { 2560 int * row, cols; 2561 }; 2562 int operator()() { 2563 subtotal = 0; 2564 for ( int c = 0; c < cols; c += 1 ) 2565 subtotal += row[c]; 2566 return subtotal; 2567 } 2568 void ?{}( Adder * adder, int row[$\,$], int cols, int & subtotal ) { 2569 adder.[rows, cols, subtotal] = [rows, cols, subtotal]; 2570 } 2571 2572 2573 2574 2575 \end{cfa} 2576 & 2577 \begin{cfa} 2578 int main() { 2579 const int rows = 10, cols = 10; 2580 int matrix[rows][cols], subtotals[rows], total = 0; 2581 // read matrix 2582 Executor executor( 4 ); // kernel threads 2583 Adder * adders[rows]; 2584 for ( r; rows ) { // send off work for executor 2585 adders[r] = new( matrix[r], cols, &subtotal[r] ); 2586 executor.send( *adders[r] ); 2587 } 2588 for ( r; rows ) { // wait for results 2589 delete( adders[r] ); 2590 total += subtotals[r]; 2591 } 2592 sout | total; 2593 } 2594 \end{cfa} 2595 \end{tabular} 2596 \caption{Executor} 2597 \end{figure} 2598 \end{comment} 2536 % 2537 % 2538 % \subsection{Thread Pools} 2539 % 2540 % In contrast to direct threading is indirect \newterm{thread pools}, \eg Java @executor@, where small jobs (work units) are inserted into a work pool for execution. 2541 % If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. 2542 % While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. 2543 % Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers. 2544 % While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. 2545 % As well, concurrency errors return, which threads pools are suppose to mitigate. 2599 2546 2600 2547 … … 2620 2567 The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults. 2621 2568 The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors. 2622 However, the design allows changing the scheduler, \eg multi-queue multi-serverwith work-stealing/sharing across the virtual processors.2569 However, the scheduler is pluggable, supporting alternative schedulers, such as multi-queue multi-server, with work-stealing/sharing across the virtual processors. 2623 2570 If several clusters exist, both threads and virtual processors, can be explicitly migrated from one cluster to another. 2624 2571 No automatic load balancing among clusters is performed by \CFA. … … 2627 2574 The user cluster is created to contain the application user-threads. 2628 2575 Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime. 2629 However, because of limitations of scheduling requirements (real-time), NUMA architecture, heterogeneous hardware, or issues with the underlying operating system, multiple clusters are sometimes necessary.2576 However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), multiple clusters are sometimes necessary. 2630 2577 2631 2578 … … 2671 2618 \subsection{Preemption} 2672 2619 2673 Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on code between cooperative scheduling to be atomic. 2674 This atomic reliance can fail on multi-core machines, because execution across cores is nondeterministic. 2675 A different reason for not supporting preemption is that it significantly complicates the runtime system, \eg Microsoft runtime does not support interrupts and on Linux systems, interrupts are complex (see below). 2620 Nondeterministic preemption provides fairness from long running threads, and forces concurrent programmers to write more robust programs, rather than relying on section of code between cooperative scheduling to be atomic. 2621 A separate reason for not supporting preemption is that it significantly complicates the runtime system. 2676 2622 Preemption is normally handled by setting a count-down timer on each virtual processor. 2677 2623 When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code. … … 2682 2628 Because preemption frequency is usually long (1 millisecond) performance cost is negligible. 2683 2629 2684 Linux switched a decade ago from specific to arbitrary process signal-delivery for applications with multiple kernel threads. 2630 However, on current Linux systems: 2685 2631 \begin{cquote} 2686 2632 A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. … … 2688 2634 SIGNAL(7) - Linux Programmer's Manual 2689 2635 \end{cquote} 2690 Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to an application, is delivered to any of its Linux subprocesses (kernel threads).2691 To ensure each virtual processor receives a preemption signal, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events.2636 Hence, the timer-expiry signal, which is generated \emph{externally} by the Linux kernel to the Linux process, is delivered to any of its Linux subprocesses (kernel threads). 2637 To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events. 2692 2638 Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order. 2693 2639 The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed. … … 2706 2652 2707 2653 To verify the implementation of the \CFA runtime, a series of microbenchmarks are performed comparing \CFA with Java OpenJDK-9, Go 1.9.2 and \uC 7.0.0. 2708 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04. 6 LTS, and \uC/\CFA are compiled with gcc 6.5.2654 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.3 LTS and \uC and \CFA are compiled with gcc 6.3. 2709 2655 2710 2656 \begin{comment} … … 2761 2707 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2762 2708 \begin{cfa} 2763 @thread@MyThread {};2764 void @main@( MyThread & ) {}2709 thread MyThread {}; 2710 void main( MyThread & ) {} 2765 2711 int main() { 2766 2712 BENCH( for ( N ) { @MyThread m;@ } ) … … 2804 2750 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2805 2751 \begin{cfa}[aboveskip=0pt,belowskip=0pt] 2806 @coroutine@C {} c;2752 coroutine C {} c; 2807 2753 void main( C & ) { for ( ;; ) { @suspend;@ } } 2808 2754 int main() { // coroutine test … … 2825 2771 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2826 2772 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2827 C function & 2 & 2 & 0 \\ 2828 \CFA generator & 2 & 2 & 0 \\ 2773 Kernel Thread & 333.5 & 332.96 & 4.1 \\ 2829 2774 \CFA Coroutine & 49 & 48.68 & 0.47 \\ 2830 2775 \CFA Thread & 105 & 105.57 & 1.37 \\ … … 2832 2777 \uC Thread & 100 & 99.29 & 0.96 \\ 2833 2778 Goroutine & 145 & 147.25 & 4.15 \\ 2834 Java Thread & 373.5 & 375.14 & 8.72 \\ 2835 Pthreads Thread & 333.5 & 332.96 & 4.1 2779 Java Thread & 373.5 & 375.14 & 8.72 2836 2780 \end{tabular} 2837 2781 \end{multicols} … … 2849 2793 \lstset{language=CFA,moredelim=**[is][\color{red}]{@}{@},deletedelim=**[is][]{`}{`}} 2850 2794 \begin{cfa} 2851 @monitor@M {} m1/*, m2, m3, m4*/;2795 monitor M {} m1/*, m2, m3, m4*/; 2852 2796 void __attribute__((noinline)) 2853 do_call( M & @mutex m/*, m2, m3, m4*/@) {}2797 do_call( M & mutex m/*, m2, m3, m4*/ ) {} 2854 2798 int main() { 2855 2799 BENCH( 2856 for( N ) do_call( m1/*, m2, m3, m4*/ );2800 for( N ) @do_call( m1/*, m2, m3, m4*/ );@ 2857 2801 ) 2858 2802 sout | result`ns; … … 2869 2813 \begin{tabular}{@{}r*{3}{D{.}{.}{3.2}}@{}} 2870 2814 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} &\multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2871 test and test-and-test lock & 26 & 26 & 0 \\ 2815 C function & 2 & 2 & 0 \\ 2816 FetchAdd + FetchSub & 26 & 26 & 0 \\ 2872 2817 Pthreads Mutex Lock & 31 & 31.71 & 0.97 \\ 2873 2818 \uC @monitor@ member rtn. & 31 & 31 & 0 \\ … … 2875 2820 \CFA @mutex@ function, 2 arg. & 84 & 85.36 & 1.99 \\ 2876 2821 \CFA @mutex@ function, 4 arg. & 158 & 161 & 4.22 \\ 2877 Java synchronized method& 27.5 & 29.79 & 2.932822 Java synchronized function & 27.5 & 29.79 & 2.93 2878 2823 \end{tabular} 2879 2824 \end{multicols} … … 2891 2836 \begin{cfa} 2892 2837 volatile int go = 0; 2893 @monitor@ M { @condition c;@} m;2838 monitor M { condition c; } m; 2894 2839 void __attribute__((noinline)) 2895 do_call( M & @mutex@a1 ) { @signal( c );@ }2840 do_call( M & mutex a1 ) { @signal( c );@ } 2896 2841 thread T {}; 2897 2842 void main( T & this ) { … … 2924 2869 \multicolumn{1}{@{}c}{} & \multicolumn{1}{c}{Median} & \multicolumn{1}{c}{Average} & \multicolumn{1}{c@{}}{Std Dev} \\ 2925 2870 Pthreads Cond. Variable & 6005 & 5681.43 & 835.45 \\ 2926 \uC @signal@ & 324 & 325.54 & 3 .02 \\2871 \uC @signal@ & 324 & 325.54 & 3,02 \\ 2927 2872 \CFA @signal@, 1 @monitor@ & 368.5 & 370.61 & 4.77 \\ 2928 2873 \CFA @signal@, 2 @monitor@ & 467 & 470.5 & 6.79 \\ … … 2944 2889 \begin{cfa} 2945 2890 volatile int go = 0; 2946 @monitor@M {} m;2891 monitor M {} m; 2947 2892 thread T {}; 2948 2893 void __attribute__((noinline)) 2949 do_call( M & @mutex@) {}2894 do_call( M & mutex ) {} 2950 2895 void main( T & ) { 2951 2896 while ( go == 0 ) { yield(); } 2952 while ( go == 1 ) { do_call( m );}2897 while ( go == 1 ) { @do_call( m );@ } 2953 2898 } 2954 2899 int __attribute__((noinline)) 2955 do_wait( M & @mutex@m ) {2900 do_wait( M & mutex m ) { 2956 2901 go = 1; // continue other thread 2957 2902 BENCH( for ( N ) { @waitfor( do_call, m );@ } )
Note:
See TracChangeset
for help on using the changeset viewer.