Changeset 287da46
- Timestamp:
- Jun 28, 2018, 1:33:48 PM (7 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, no_list, persistent-indexer, pthread-emulation, qualifiedEnum
- Children:
- 944ce47
- Parents:
- 64188cc
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r64188cc r287da46 235 235 \CFA is a modern, polymorphic, \emph{non-object-oriented} extension of the C programming language. 236 236 This paper discusses the design of the concurrency and parallelism features in \CFA, and the concurrent runtime-system. 237 These features are created from scratch as ISO C lacks concurrency, relying largely on the pthreads library .237 These features are created from scratch as ISO C lacks concurrency, relying largely on the pthreads library for concurrency. 238 238 Coroutines and lightweight (user) threads are introduced into the language. 239 239 In addition, monitors are added as a high-level mechanism for mutual exclusion and synchronization. 240 240 A unique contribution is allowing multiple monitors to be safely acquired simultaneously. 241 241 All features respect the expectations of C programmers, while being fully integrate with the \CFA polymorphic type-system and other language features. 242 Finally, experimental results are presented to compare the performance of the new features with similar mechanisms in other concurrent programming-languages.242 Finally, experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages. 243 243 }% 244 244 … … 257 257 While the simplest concurrency system is a thread and a lock, this low-level approach is hard to master. 258 258 An easier approach for programmers is to support higher-level constructs as the basis of concurrency. 259 Indeed, for highly productive concurrentprogramming, high-level approaches are much more popular~\cite{Hochstein05}.260 Examples of high-level approaches are jobs (t asks) based~\cite{TBB}, implicit threading~\cite{OpenMP}, monitors~\cite{Java}, channels~\cite{CSP,Go}, and message passing~\cite{Erlang,MPI}.259 Indeed, for highly-productive concurrent-programming, high-level approaches are much more popular~\cite{Hochstein05}. 260 Examples of high-level approaches are jobs (thread pool)~\cite{TBB}, implicit threading~\cite{OpenMP}, monitors~\cite{Java}, channels~\cite{CSP,Go}, and message passing~\cite{Erlang,MPI}. 261 261 262 262 The following terminology is used. 263 263 A \newterm{thread} is a fundamental unit of execution that runs a sequence of code and requires a stack to maintain state. 264 Multiple simultaneous threads give rise to \newterm{concurrency}, which requires locking to ensure safe communication and access to shared data. 265 % Correspondingly, concurrency is defined as the concepts and challenges that occur when multiple independent (sharing memory, timing dependencies, \etc) concurrent threads are introduced. 264 Multiple simultaneous threads give rise to \newterm{concurrency}, which requires locking to ensure access to shared data and safe communication. 266 265 \newterm{Locking}, and by extension \newterm{locks}, are defined as a mechanism to prevent progress of threads to provide safety. 267 266 \newterm{Parallelism} is running multiple threads simultaneously. 268 267 Parallelism implies \emph{actual} simultaneous execution, where concurrency only requires \emph{apparent} simultaneous execution. 269 268 As such, parallelism only affects performance, which is observed through differences in space and/or time at runtime. 270 271 269 Hence, there are two problems to be solved: concurrency and parallelism. 272 270 While these two concepts are often combined, they are distinct, requiring different tools~\cite[\S~2]{Buhr05a}. … … 274 272 275 273 The proposed concurrency API is implemented in a dialect of C, called \CFA. 276 The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantic , and type checking, and the corresponding high-performance runtime-library to implement the concurrencyfeatures.274 The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantics, and type checking, and the corresponding high-performance runtime-library to implement the concurrent features. 277 275 278 276 … … 286 284 Like C, the building blocks of \CFA are structures and routines. 287 285 Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions. 288 While \CFA is not an object-oriented language, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}.289 While some \CFA features are common in object-oriented programming -languages, they are an independent capability allowing \CFA to adopt them while retaining a procedural paradigm.286 While \CFA is not object oriented, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does have a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}. 287 While some \CFA features are common in object-oriented programming, they are independent capabilities allowing \CFA to adopt them while maintaining a procedural paradigm. 290 288 291 289 … … 412 410 Overloading is important for \CFA concurrency since the runtime system relies on creating different types to represent concurrency objects. 413 411 Therefore, overloading eliminates long prefixes and other naming conventions to prevent name clashes. 414 As seen in Section~\ref{ basics}, routine @main@ is heavily overloaded.412 As seen in Section~\ref{s:Concurrency}, routine @main@ is heavily overloaded. 415 413 For example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name: 416 414 \begin{cfa} … … 481 479 { 482 480 VLA x, y = { 20, 0x01 }, z = y; $\C{// z points to y}$ 483 // x{}; y{ 20, 0x01 }; z{ z, y };481 // $\LstCommentStyle{\color{red}\ \ \ x\{\};\ \ \ \ \ \ \ \ \ y\{ 20, 0x01 \};\ \ \ \ \ \ \ \ \ \ z\{ z, y \};\ \ \ \ \ \ \ implicit calls}$ 484 482 ^x{}; $\C{// deallocate x}$ 485 483 x{}; $\C{// reallocate x}$ … … 488 486 y{ x }; $\C{// reallocate y, points to x}$ 489 487 x{}; $\C{// reallocate x, not pointing to y}$ 490 // ^z{}; ^y{}; ^x{}; 491 } 488 } // $\LstCommentStyle{\color{red}\^{}z\{\};\ \ \^{}y\{\};\ \ \^{}x\{\};\ \ \ implicit calls}$ 492 489 \end{cfa} 493 490 Like \CC, construction is implicit on allocation (stack/heap) and destruction is implicit on deallocation. … … 537 534 538 535 Assertions can be @otype@ or @dtype@. 539 @otype@ refers to a ``complete'' object, \ie an object has a size, default constructor, copy constructor, destructor and an assignment operator.540 @dtype@ only guarantees an objecthas a size and alignment.541 542 Using the return type for discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@:536 @otype@ refers to a \emph{complete} object, \ie an object has a size, alignment, default constructor, copy constructor, destructor and an assignment operator. 537 @dtype@ refers to an \emph{incomplete} object, \ie, an object only has a size and alignment. 538 539 Using the return type for overload discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@: 543 540 \begin{cfa} 544 541 forall( dtype T | sized(T) ) T * alloc( void ) { return (T *)malloc( sizeof(T) ); } … … 550 547 551 548 552 \section{Concurrency Basics}\label{basics} 549 \section{Concurrency} 550 \label{s:Concurrency} 553 551 554 552 At its core, concurrency is based on multiple call-stacks and scheduling threads executing on these stacks. 555 553 Multiple call stacks (or contexts) and a single thread of execution, called \newterm{coroutining}~\cite{Conway63,Marlin80}, does \emph{not} imply concurrency~\cite[\S~2]{Buhr05a}. 556 In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie given fixed inputs, the execution path to the outputsis fixed and predictable.554 In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable. 557 555 A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming; 558 556 a \newterm{stackfull} coroutine executes on its own stack, allowing full generality. 559 Only stackfull coroutines are a stepping -stone to concurrency.560 561 The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a scheduling oracle, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}.557 Only stackfull coroutines are a stepping stone to concurrency. 558 559 The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}. 562 560 Therefore, a minimal concurrency system is possible using coroutines (see Section \ref{coroutine}) in conjunction with a scheduler to decide where to context switch next. 563 561 The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}. … … 591 589 \subsection{Coroutines: A Stepping Stone}\label{coroutine} 592 590 593 While the focus of this discussion is concurrency and parallelism, it is important to address coroutines, which are a significant building block of a concurrency system .591 While the focus of this discussion is concurrency and parallelism, it is important to address coroutines, which are a significant building block of a concurrency system (but not concurrent among themselves). 594 592 Coroutines are generalized routines allowing execution to be temporarily suspend and later resumed. 595 593 Hence, unlike a normal routine, a coroutine may not terminate when it returns to its caller, allowing it to be restarted with the values and execution location present at the point of suspension. … … 598 596 Therefore, the core \CFA coroutine-API for has two fundamental features: independent call-stacks and @suspend@/@resume@ operations. 599 597 600 For example, a problem made easier with coroutines is unbounded generators, \eg generating an infinite sequence of Fibonacci numbers , where Figure~\ref{f:C-fibonacci} shows conventional approaches for writing a Fibonacci generator in C.598 For example, a problem made easier with coroutines is unbounded generators, \eg generating an infinite sequence of Fibonacci numbers 601 599 \begin{displaymath} 602 600 \mathsf{fib}(n) = \left \{ … … 608 606 \right. 609 607 \end{displaymath} 610 Figure~\ref{f:GlobalVariables} illustrates the following problems: 611 unique unencapsulated global variables necessary to retain state between calls; 612 only one Fibonacci generator; 613 execution state must be explicitly retained via explicit state variables. 614 Figure~\ref{f:ExternalState} addresses these issues: 615 unencapsulated program global variables become encapsulated structure variables; 616 unique global variables are replaced by multiple Fibonacci objects; 617 explicit execution state is removed by precomputing the first two Fibonacci numbers and returning $\mathsf{fib}(n-2)$. 608 where Figure~\ref{f:C-fibonacci} shows conventional approaches for writing a Fibonacci generator in C. 609 Figure~\ref{f:GlobalVariables} illustrates the following problems: unique unencapsulated global variables necessary to retain state between calls, only one Fibonacci generator, and execution state must be explicitly retained via explicit state variables. 610 Figure~\ref{f:ExternalState} addresses these issues: unencapsulated program global variables become encapsulated structure variables, unique global variables are replaced by multiple Fibonacci objects, and explicit execution state is removed by precomputing the first two Fibonacci numbers and returning $\mathsf{fib}(n-2)$. 618 611 619 612 \begin{figure} … … 663 656 \end{lrbox} 664 657 665 \subfloat[3 States: global variables]{\ usebox\myboxA}658 \subfloat[3 States: global variables]{\label{f:GlobalVariables}\usebox\myboxA} 666 659 \qquad 667 \subfloat[1 State: external variables]{\ usebox\myboxB}660 \subfloat[1 State: external variables]{\label{f:ExternalState}\usebox\myboxB} 668 661 \caption{C Fibonacci Implementations} 669 662 \label{f:C-fibonacci} … … 723 716 \subfloat[1 State, internal variables]{\label{f:Coroutine1State}\usebox\myboxB} 724 717 \caption{\CFA Coroutine Fibonacci Implementations} 725 \label{f: fibonacci-cfa}718 \label{f:cfa-fibonacci} 726 719 \end{figure} 727 720 … … 732 725 The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@; 733 726 on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. 734 The first @resume@ is special because it cocalls the coroutine at its coroutine main and allocates thestack;727 The first @resume@ is special because it allocates the coroutine stack and cocalls its coroutine main on that stack; 735 728 when the coroutine main returns, its stack is deallocated. 736 729 Hence, @Fib@ is an object at creation, transitions to a coroutine on its first resume, and transitions back to an object when the coroutine main finishes. … … 754 747 \end{quote} 755 748 The example takes advantage of resuming a coroutine in the constructor to prime the loops so the first character sent for formatting appears inside the nested loops. 756 The destruction provides a newline if formatted text ends with a full line.749 The destruction provides a newline, if formatted text ends with a full line. 757 750 Figure~\ref{f:CFmt} shows the C equivalent formatter, where the loops of the coroutine are flatten (linearized) and rechecked on each call because execution location is not retained between calls. 751 (Linearized code is the bane of device drivers.) 758 752 759 753 \begin{figure} … … 840 834 841 835 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines. 842 However, @resume@/@suspend@ context switch to existing stack-framesrather than create new ones so there is no stack growth.843 \newterm{Symmetric (full) coroutine}s have a coroutine call a resuming routine for another coroutine, which eventually forms a resuming-call cycle.836 However, @resume@ and @suspend@ context switch among existing stack-frames, rather than create new ones so there is no stack growth. 837 \newterm{Symmetric (full) coroutine}s have a coroutine call to a resuming routine for another coroutine, and its coroutine main has a call to another resuming routine, which eventually forms a resuming-call cycle. 844 838 (The trivial cycle is a coroutine resuming itself.) 845 839 This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch. … … 923 917 Figure~\ref{f:ProdCons} shows a producer/consumer symmetric-coroutine performing bi-directional communication. 924 918 Since the solution involves a full-coroutining cycle, the program main creates one coroutine in isolation, passes this coroutine to its partner, and closes the cycle at the call to @start@. 925 The @start@ routine communicates both the number of elements to be produced and the consumer into the producer's coroutine 919 The @start@ routine communicates both the number of elements to be produced and the consumer into the producer's coroutine-structure. 926 920 Then the @resume@ to @prod@ creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it. 927 921 @prod@'s coroutine main starts, creates local variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer to deliver the values, and printing the status returned from the consumer. … … 930 924 For the first resume, @cons@'s stack is initialized, creating local variables retained between subsequent activations of the coroutine. 931 925 The consumer iterates until the @done@ flag is set, prints the values delivered by the producer, increments status, and calls back to the producer via @payment@, and on return from @payment@, prints the receipt from the producer and increments @money@ (inflation). 932 The call from the consumer to the@payment@ introduces the cycle between producer and consumer.926 The call from the consumer to @payment@ introduces the cycle between producer and consumer. 933 927 When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed. 934 928 The context switch restarts the producer at the point where it was last context switched, so it continues in @delivery@ after the resume. … … 955 949 956 950 Object-oriented inheritance provides extra fields and code in a restricted context, but it requires programmers to explicitly perform the inheritance: 957 \begin{cfa} 958 struct mycoroutine $\textbf{\textsf{inherits}}$baseCoroutine { ... }951 \begin{cfa}[morekeywords={class,inherits}] 952 class mycoroutine inherits baseCoroutine { ... } 959 953 \end{cfa} 960 954 and the programming language (and possibly its tool set, \eg debugger) may need to understand @baseCoroutine@ because of the stack. … … 1090 1084 \end{tabular} 1091 1085 \end{cquote} 1092 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor s}.)1086 (The qualifier @mutex@ for the destructor parameter is discussed in Section~\ref{s:Monitor}.) 1093 1087 Like a coroutine, the statically-typed @main@ routine is the starting point (first stack frame) of a user thread. 1094 1088 The difference is that a coroutine borrows a thread from its caller, so the first thread resuming a coroutine creates an instance of @main@; … … 1171 1165 The program uses heap-based threads because each thread needs different constructor values. 1172 1166 (Python provides a simple iteration mechanism to initialize array elements to different values allowing stack allocation.) 1173 The allocation/deallocation pattern appears unusual because allocated objects are immediately de leted without any intervening code.1167 The allocation/deallocation pattern appears unusual because allocated objects are immediately deallocated without any intervening code. 1174 1168 However, for threads, the deletion provides implicit synchronization, which is the intervening code. 1175 1169 While the subtotals are added in linear order rather than completion order, which slight inhibits concurrency, the computation is restricted by the critical-path thread (\ie the thread that takes the longest), and so any inhibited concurrency is very small as totalling the subtotals is trivial. … … 1213 1207 Since many deterministic challenges appear with the use of mutable shared state, some languages/libraries disallow it, \eg Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, Akka~\cite{Akka} (Scala). 1214 1208 In these paradigms, interaction among concurrent objects is performed by stateless message-passing~\cite{Thoth,Harmony,V-Kernel} or other paradigms closely relate to networking concepts, \eg channels~\cite{CSP,Go}. 1215 However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm ,between regular and concurrent computation, \eg routine call versus message passing.1209 However, in call/return-based languages, these approaches force a clear distinction, \ie introduce a new programming paradigm between regular and concurrent computation, \eg routine call versus message passing. 1216 1210 Hence, a programmer must learn and manipulate two sets of design patterns. 1217 1211 While this distinction can be hidden away in library code, effective use of the library still has to take both paradigms into account. 1218 1212 In contrast, approaches based on statefull models more closely resemble the standard call/return programming-model, resulting in a single programming paradigm. 1219 1213 1220 At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of lock s mechanismare constructed, \eg semaphores~\cite{Dijkstra68b}, barriers, and path expressions~\cite{Campbell74}.1214 At the lowest level, concurrent control is implemented by atomic operations, upon which different kinds of locking mechanisms are constructed, \eg semaphores~\cite{Dijkstra68b}, barriers, and path expressions~\cite{Campbell74}. 1221 1215 However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}. 1222 1216 A newer approach for restricting non-determinism is transactional memory~\cite{Herlihy93}. … … 1259 1253 1260 1254 1261 \section{Monitor s}1262 \label{s:Monitor s}1255 \section{Monitor} 1256 \label{s:Monitor} 1263 1257 1264 1258 A \textbf{monitor} is a set of routines that ensure mutual exclusion when accessing shared state. … … 1320 1314 \end{cfa} 1321 1315 1322 Mandatory monitor qualifiers have the benefit of being self-document ed, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant.1316 Mandatory monitor qualifiers have the benefit of being self-documenting, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant. 1323 1317 Instead, one of qualifier semantics can be the default, and the other required. 1324 For example, assume the safe @mutex@ option for a monitor parameterbecause assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors.1325 On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly ``this parameter is not special''.1318 For example, assume the safe @mutex@ qualifier for all monitor parameters because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. 1319 On the other hand, assuming \lstinline[morekeywords=nomutex]@nomutex@ is the \emph{normal} parameter behaviour, stating explicitly \emph{this parameter is not special}. 1326 1320 Providing a default qualifier implies knowing whether a parameter is a monitor. 1327 1321 Since \CFA relies heavily on traits as an abstraction mechanism, the distinction between a type that is a monitor and a type that looks like a monitor can become blurred. … … 1389 1383 In the calls to @bar@ and @baz@, the monitors are acquired in opposite order. 1390 1384 1391 However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamic allytracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}.1385 However, such use leads to lock acquiring order problems resulting in deadlock~\cite{Lister77}, where detecting it requires dynamic tracking of monitor calls, and dealing with it requires rollback semantics~\cite{Dice10}. 1392 1386 In \CFA, safety is guaranteed by using bulk acquire of all monitors to shared objects, whereas other monitor systems provide no aid. 1393 1387 While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases. … … 1445 1439 \newterm{Internal scheduling} is characterized by each thread entering the monitor and making an individual decision about proceeding or blocking, while \newterm{external scheduling} is characterized by an entering thread making a decision about proceeding for itself and on behalf of other threads attempting entry. 1446 1440 1447 Figure~\ref{f:BBInt} shows a \CFA bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@.1441 Figure~\ref{f:BBInt} shows a \CFA generic bounded-buffer with internal scheduling, where producers/consumers enter the monitor, see the buffer is full/empty, and block on an appropriate condition lock, @full@/@empty@. 1448 1442 The @wait@ routine atomically blocks the calling thread and implicitly releases the monitor lock(s) for all monitors in the routine's parameter list. 1449 1443 The appropriate condition lock is signalled to unblock an opposite kind of thread after an element is inserted/removed from the buffer. … … 1459 1453 The signalling thread blocks but is marked for urgrent unblocking at the next scheduling point and the signalled thread continues. 1460 1454 \end{enumerate} 1461 The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service .1455 The first approach is too restrictive, as it precludes solving a reasonable class of problems, \eg dating service (see Figure~\ref{f:DatingService}). 1462 1456 \CFA supports the next two semantics as both are useful. 1463 1457 Finally, while it is common to store a @condition@ as a field of the monitor, in \CFA, a @condition@ variable can be created/stored independently. 1464 Furthermore, a condition variable is tied to a \emph{group} of monitors on first use (called \newterm{branding}), which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors.1458 Furthermore, a condition variable is tied to a \emph{group} of monitors on first use, called \newterm{branding}, which means that using internal scheduling with distinct sets of monitors requires one condition variable per set of monitors. 1465 1459 1466 1460 \begin{figure} … … 1531 1525 \end{figure} 1532 1526 1533 Figure~\ref{f:BBExt} shows a \CFA bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot.1527 Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot. 1534 1528 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion. 1535 1529 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. 1536 Threads making calls to routines that are currently excluded block outside (external) of the monitor on a calling queue, versus blocking on condition queues inside (internal) of the monitor.1530 Threads making calls to routines that are currently excluded, block outside (external) of the monitor on a calling queue, versus blocking on condition queues inside (internal) of the monitor. 1537 1531 % External scheduling is more constrained and explicit, which helps programmers reduce the non-deterministic nature of concurrency. 1538 1532 External scheduling allows users to wait for events from other threads without concern of unrelated events occurring. … … 1543 1537 For internal scheduling, non-blocking signalling (as in the producer/consumer example) is used when the signaller is providing the cooperation for a waiting thread; 1544 1538 the signaller enters the monitor and changes state, detects a waiting threads that can use the state, performs a non-blocking signal on the condition queue for the waiting thread, and exits the monitor to run concurrently. 1545 The waiter unblocks next , uses/takes the state, and exits the monitor.1539 The waiter unblocks next from the urgent queue, uses/takes the state, and exits the monitor. 1546 1540 Blocking signalling is the reverse, where the waiter is providing the cooperation for the signalling thread; 1547 1541 the signaller enters the monitor, detects a waiting thread providing the necessary state, performs a blocking signal to place it on the urgent queue and unblock the waiter. 1548 1542 The waiter changes state and exits the monitor, and the signaller unblocks next from the urgent queue to use/take the state. 1549 1543 1550 Figure~\ref{f:DatingService} shows a dating service demonstrating the two forms of signalling: non-blocking and blocking.1544 Figure~\ref{f:DatingService} shows a dating service demonstrating non-blocking and blocking signalling. 1551 1545 The dating service matches girl and boy threads with matching compatibility codes so they can exchange phone numbers. 1552 1546 A thread blocks until an appropriate partner arrives. 1553 The complexity is exchanging phone number in the monitor because the monitor mutual-exclusion property prevents exchanging numbers.1554 For internal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the oppose number, post its phone number, and unblock the partner.1555 For external scheduling, the implicit urgent-condition replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher..1547 The complexity is exchanging phone numbers in the monitor because of the mutual-exclusion property. 1548 For signal scheduling, the @exchange@ condition is necessary to block the thread finding the match, while the matcher unblocks to take the oppose number, post its phone number, and unblock the partner. 1549 For signal-block scheduling, the implicit urgent-queue replaces the explict @exchange@-condition and @signal_block@ puts the finding thread on the urgent condition and unblocks the matcher. 1556 1550 1557 1551 The dating service is an example of a monitor that cannot be written using external scheduling because it requires knowledge of calling parameters to make scheduling decisions, and parameters of waiting threads are unavailable; … … 1656 1650 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. 1657 1651 To override the implicit multi-monitor wait, specific mutex parameter(s) can be specified, \eg @waitfor( rtn, m1 )@. 1658 Waitforstatically verifies the released monitors are the same as the acquired mutex-parameters of the given routine or routine pointer.1652 @waitfor@ statically verifies the released monitors are the same as the acquired mutex-parameters of the given routine or routine pointer. 1659 1653 To statically verify the released monitors match with the accepted routine's mutex parameters, the routine (pointer) prototype must be accessible. 1660 1654 … … 1687 1681 For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}. 1688 1682 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. 1683 Furthermore, \CFA concurrency has no superious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implict form of barging. 1689 1684 1690 1685 … … 1764 1759 1765 1760 While a number of approaches were examined~\cite[\S~4.3]{Delisle18}, the solution chosen for \CFA is a novel techique called \newterm{partial signalling}. 1766 Signalled threads are moved to anurgent queue and the waiter at the front defines the set of monitors necessary for it to unblock.1761 Signalled threads are moved to the urgent queue and the waiter at the front defines the set of monitors necessary for it to unblock. 1767 1762 Partial signalling transfers ownership of monitors to the front waiter. 1768 1763 When the signaller thread exits or waits in the monitor the front waiter is unblocked if all its monitors are released. … … 1815 1810 Figure~\ref{fig:BulkMonitor} shows the \CFA monitor implementation. 1816 1811 The mutex routine called is associated with each thread on the entry queue, while a list of acceptable routines is kept separately. 1817 The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a linear search.1812 The accepted list is a variable-sized array of accepted routine pointers, so the single instruction bitmask comparison is replaced by dereferencing a pointer followed by a (usually short) linear search. 1818 1813 1819 1814 … … 1821 1816 \label{s:Multi-MonitorScheduling} 1822 1817 1823 External scheduling, like internal scheduling, becomes significantly more complex when introducing multi-monitor syntax.1824 Even in the simplest possible case, new semantics needs to be established:1818 External scheduling, like internal scheduling, becomes significantly more complex for multi-monitor semantics. 1819 Even in the simplest case, new semantics needs to be established. 1825 1820 \begin{cfa} 1826 1821 monitor M {}; … … 1843 1838 } 1844 1839 \end{cfa} 1845 Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by accepting routine.1840 Again, the set of monitors passed to the @waitfor@ statement must be entirely contained in the set of monitors already acquired by the accepting routine. 1846 1841 1847 1842 Note, for internal and external scheduling with multiple monitors, a signalling or accepting thread must match exactly, \ie partial matching results in waiting. … … 1879 1874 \lstMakeShortInline@% 1880 1875 \end{cquote} 1876 For both internal and external scheduling, the set of monitors is disjoint so unblocking is impossible. 1881 1877 1882 1878 … … 1908 1904 In all cases, the statement following is executed \emph{after} a clause is executed to know which of the clauses executed. 1909 1905 1910 Agroup of conditional @waitfor@ clauses is \emph{not} the same as a group of @if@ statements, e.g.:1906 Note, a group of conditional @waitfor@ clauses is \emph{not} the same as a group of @if@ statements, e.g.: 1911 1907 \begin{cfa} 1912 1908 if ( C1 ) waitfor( mem1 ); when ( C1 ) waitfor( mem1 ); … … 1916 1912 The right example accepts either @mem1@ or @mem2@ if @C1@ and @C2@ are true. 1917 1913 1918 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object deallocated.1914 An interesting use of @waitfor@ is accepting the @mutex@ destructor to know when an object is deallocated. 1919 1915 \begin{cfa} 1920 1916 void insert( Buffer(T) & mutex buffer, T elem ) with( buffer ) { … … 1927 1923 } 1928 1924 \end{cfa} 1929 However, the @waitfor@ semantics do not work, since using an object after its destructor is called is undefined. 1930 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then deallocates the object. 1925 When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action. 1926 However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined. 1927 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unbloced from the urgent queue to deallocate the object. 1931 1928 Accepting the destructor is an idiomatic way to terminate a thread in \CFA. 1932 1929 … … 1934 1931 \subsection{\protect\lstinline@mutex@ Threads} 1935 1932 1936 Threads in \CFA are monitors, so all monitor features are available when using threads. 1937 Figure~\ref{f:pingpong} shows an example of two threads calling and accepting calls from each other in a cycle. 1938 Note, both ping/pong threads are globally declared, @pi@/@po@, and hence, start (and possibly complete) before the program starts. 1933 Threads in \CFA are monitors to allow direct communication among threads, \ie threads can have mutex routines that are called by other threads. 1934 Hence, all monitor features are available when using threads. 1935 Figure~\ref{f:pingpong} shows an example of two threads directly calling each other and accepting calls from each other in a cycle. 1936 Note, both ping/pong threads are globally declared, @pi@/@po@, and hence, start (and possibly complete) before the program main starts. 1939 1937 1940 1938 \begin{figure} … … 1979 1977 Now, high-performance applications must care about parallelism, which requires concurrency. 1980 1978 The lowest-level approach of parallelism is to use \newterm{kernel threads} in combination with semantics like @fork@, @join@, \etc. 1981 However, kernel threads are better as an implementation tool because of complexity and high cost.1982 Therefore, different abstractions are layered onto kernel threads to simplify them.1979 However, kernel threads are better as an implementation tool because of complexity and higher cost. 1980 Therefore, different abstractions are often layered onto kernel threads to simplify them, \eg pthreads. 1983 1981 1984 1982 … … 1986 1984 1987 1985 A direct improvement on kernel threads is user threads, \eg Erlang~\cite{Erlang} and \uC~\cite{uC++book}. 1988 This approach provides an interface that matches the language paradigms, more control over concurrency inthe language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems.1986 This approach provides an interface that matches the language paradigms, more control over concurrency by the language runtime, and an abstract (and portable) interface to the underlying kernel threads across operating systems. 1989 1987 In many cases, user threads can be used on a much larger scale (100,000 threads). 1990 Like kernel threads, user threads support preemption, which maximizes nondeterminism, but introduces concurrency errors: race, livelock, starvation, and deadlock.1991 \CFA adopts user-threads as they represent the truest realization of concurrency and can build the following approaches and more, \egactors~\cite{Actors}.1988 Like kernel threads, user threads support preemption, which maximizes nondeterminism, but introduces the same concurrency errors: race, livelock, starvation, and deadlock. 1989 \CFA adopts user-threads as they represent the truest realization of concurrency and can build any other concurrency approach, \eg thread pools and actors~\cite{Actors}. 1992 1990 1993 1991 … … 2008 2006 While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. 2009 2007 As well, concurrency errors return, which threads pools are suppose to mitigate. 2010 The gold standard for thread pool is Intel's TBB library~\cite{TBB}.2008 Intel's TBB library~\cite{TBB} is the gold standard for thread pools. 2011 2009 2012 2010 … … 2014 2012 2015 2013 Figure~\ref{f:RunTimeStructure} illustrates the runtime structure of a \CFA program. 2016 In addition to the new kinds of objects introduced by \CFA, there are two more runtime entities used to control parallel execution .2014 In addition to the new kinds of objects introduced by \CFA, there are two more runtime entities used to control parallel execution: cluster and (virtual) processor. 2017 2015 An executing thread is illustrated by its containment in a processor. 2018 2016 … … 2028 2026 \label{s:RuntimeStructureCluster} 2029 2027 2030 A \newterm{cluster} is a collection of threads and virtual processors (abstract ion a kernelthread) that execute the threads (like a virtual machine).2028 A \newterm{cluster} is a collection of threads and virtual processors (abstract kernel-thread) that execute the threads (like a virtual machine). 2031 2029 The purpose of a cluster is to control the amount of parallelism that is possible among threads, plus scheduling and other execution defaults. 2032 2030 The default cluster-scheduler is single-queue multi-server, which provides automatic load-balancing of threads on processors. … … 2035 2033 No automatic load balancing among clusters is performed by \CFA. 2036 2034 2037 When a \CFA program begins execution, it creates two clusters: system and user. 2038 The system cluster contains a processor that does not execute user threads. 2039 Instead, the system cluster handles system-related operations, such as catching errors that occur on the user clusters, printing appropriate error information, and shutting down \CFA. 2040 A user cluster is created to contain the user threads. 2035 When a \CFA program begins execution, it creates a user cluster with a single processor and a special processor to handle preemption that does not execute user threads. 2036 The user cluster is created to contain the application user-threads. 2041 2037 Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime. 2042 However, because of limitations of the underlying operating system, specialhardware, or scheduling requirements (real-time), it is sometimes necessary to have multiple clusters.2038 However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), it is sometimes necessary to have multiple clusters. 2043 2039 2044 2040 … … 2049 2045 Programs may use more virtual processors than hardware processors. 2050 2046 On a multiprocessor, kernel threads are distributed across the hardware processors resulting in virtual processors executing in parallel. 2051 (It is possible to use affinity to lock a virtual processor onto a particular hardware processor~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which is used when caching issues occur or for heterogeneous hardware processor .)2047 (It is possible to use affinity to lock a virtual processor onto a particular hardware processor~\cite{affinityLinux, affinityWindows, affinityFreebsd, affinityNetbsd, affinityMacosx}, which is used when caching issues occur or for heterogeneous hardware processors.) 2052 2048 The \CFA runtime attempts to block unused processors and unblock processors as the system load increases; 2053 2049 balancing the workload with processors is difficult. … … 2071 2067 As well, chained stacks require all modules be recompiled to use this feature, which breaks backward compatibility with existing C libraries. 2072 2068 In the long term, it is likely C libraries will migrate to stack chaining to support concurrency, at only a minimal cost to sequential programs. 2073 Nevertheless, experience teaching \uC~\cite{CS343} shows fixed-sized stacks are rarely an issue in themost concurrent programs.2069 Nevertheless, experience teaching \uC~\cite{CS343} shows fixed-sized stacks are rarely an issue in most concurrent programs. 2074 2070 2075 2071 A primary implementation challenge is avoiding contention from dynamically allocating memory because of bulk acquire, \eg the internal-scheduling design is (almost) free of allocations. … … 2089 2085 This method is a 2-step context-switch and provides a clear distinction between user and kernel code, where scheduling and other system operations happen. 2090 2086 The alternative 1-step context-switch uses the \emph{from} thread's stack to schedule and then context-switches directly to the \emph{to} thread's stack. 2091 Experimental results (not shown) show the performance difference between these two approaches is virtually equivalent, because the 1-step performance is dominated by locking instructionsto prevent a race condition.2087 Experimental results (not presented) show the performance difference between these two approaches is virtually equivalent, because the 1-step performance is dominated by a lock instruction to prevent a race condition. 2092 2088 2093 2089 All kernel threads (@pthreads@) created a stack. … … 2098 2094 Finally, an important aspect for a complete threading system is preemption, which introduces extra non-determinism via transparent interleaving, rather than cooperation among threads for proper scheduling and processor fairness from long-running threads. 2099 2095 Because preemption frequency is usually long, 1 millisecond, performance cost is negligible. 2100 2101 2096 Preemption is normally handled by setting a count-down timer on each virtual processor. 2102 2097 When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code. … … 2106 2101 therefore, all virtual processors in a cluster need to have the same signal mask. 2107 2102 2108 However, on UNIX systems:2103 However, on current UNIX systems: 2109 2104 \begin{quote} 2110 2105 A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. … … 2113 2108 \end{quote} 2114 2109 Hence, the timer-expiry signal, which is generated \emph{externally} by the UNIX kernel to the UNIX process, is delivered to any of its UNIX subprocesses (kernel threads). 2115 To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on one virtual processor, and only it sets timer events.2110 To ensure each virtual processor receives its own preemption signals, a discrete-event simulation is run on a special virtual processor, and only it sets and receives timer events. 2116 2111 Virtual processors register an expiration time with the discrete-event simulator, which is inserted in sorted order. 2117 2112 The simulation sets the count-down timer to the value at the head of the event list, and when the timer expires, all events less than or equal to the current time are processed. … … 2125 2120 Table~\ref{t:machine} shows the specifications of the computer used to run the benchmarks, and the versions of the software used in the comparison. 2126 2121 2127 \begin{table} [h]2122 \begin{table} 2128 2123 \centering 2129 2124 \caption{Experiment environment} … … 2163 2158 The method used to get time is @clock_gettime( CLOCK_REALTIME )@. 2164 2159 Each benchmark is performed @N@ times, where @N@ varies depending on the benchmark, the total time is divided by @N@ to obtain the average time for a benchmark. 2160 All omitted tests for other languages are functionally identical to the shown \CFA test. 2165 2161 2166 2162 … … 2173 2169 The thread context switch is 2-step using yield, \ie enter and return from the runtime kernel. 2174 2170 Figure~\ref{f:ctx-switch} shows the code for coroutines/threads with all results in Table~\ref{tab:ctx-switch}. 2175 All omitted tests for other languages are functionally identical to this test (as for all other tests).2176 2171 The difference in performance between coroutine and thread context-switch is the cost of scheduling for threads, whereas coroutines are self-scheduling. 2177 2172 … … 2205 2200 \end{lrbox} 2206 2201 2207 \subfloat[Coroutine]{\ label{f:GlobalVariables}\usebox\myboxA}2202 \subfloat[Coroutine]{\usebox\myboxA} 2208 2203 \quad 2209 \subfloat[Thread]{\ label{f:ExternalState}\usebox\myboxB}2204 \subfloat[Thread]{\usebox\myboxB} 2210 2205 \captionof{figure}{\CFA context-switch benchmark} 2211 2206 \label{f:ctx-switch}
Note: See TracChangeset
for help on using the changeset viewer.