Changeset 9f66811 for doc/papers/concurrency/Paper.tex
- Timestamp:
- Jun 28, 2018, 11:53:25 PM (6 years ago)
- Branches:
- ADT, aaron-thesis, arm-eh, ast-experimental, cleanup-dtors, deferred_resn, demangler, enum, forall-pointer-decay, jacob/cs343-translation, jenkins-sandbox, master, new-ast, new-ast-unique-expr, no_list, persistent-indexer, pthread-emulation, qualifiedEnum
- Children:
- 713926ca, c5283ba
- Parents:
- 944ce47
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r944ce47 r9f66811 32 32 \renewcommand{\linenumberfont}{\scriptsize\sffamily} 33 33 34 \renewcommand{\topfraction}{0.8} % float must be greater than X of the page before it is forced onto its own page 35 \renewcommand{\bottomfraction}{0.8} % float must be greater than X of the page before it is forced onto its own page 36 \renewcommand{\floatpagefraction}{0.8} % float must be greater than X of the page before it is forced onto its own page 34 37 \renewcommand{\textfraction}{0.0} % the entire page maybe devoted to floats with no text on the page at all 35 38 … … 132 135 \makeatother 133 136 134 \newenvironment{cquote}{% 135 \list{}{\lstset{resetmargins=true,aboveskip=0pt,belowskip=0pt}\topsep=3pt\parsep=0pt\leftmargin=\parindentlnth\rightmargin\leftmargin}% 136 \item\relax 137 }{% 138 \endlist 139 }% cquote 137 \newenvironment{cquote} 138 {\list{}{\lstset{resetmargins=true,aboveskip=0pt,belowskip=0pt}\topsep=3pt\parsep=0pt\leftmargin=\parindentlnth\rightmargin\leftmargin}% 139 \item\relax} 140 {\endlist} 141 142 %\newenvironment{cquote}{% 143 %\list{}{\lstset{resetmargins=true,aboveskip=0pt,belowskip=0pt}\topsep=3pt\parsep=0pt\leftmargin=\parindentlnth\rightmargin\leftmargin}% 144 %\item\relax% 145 %}{% 146 %\endlist% 147 %}% cquote 140 148 141 149 % CFA programming language, based on ANSI C (with some gcc additions) … … 271 279 Concurrency tools handle mutual exclusion and synchronization, while parallelism tools handle performance, cost, and resource utilization. 272 280 273 The proposed concurrency API is implemented in a dialect of C, called \CFA .281 The proposed concurrency API is implemented in a dialect of C, called \CFA (pronounced C-for-all). 274 282 The paper discusses how the language features are added to the \CFA translator with respect to parsing, semantics, and type checking, and the corresponding high-performance runtime-library to implement the concurrent features. 275 283 … … 281 289 282 290 \CFA is a non-object-oriented extension of ISO-C, and hence, supports all C paradigms. 283 %It is a non-object-oriented system-language, meaning most of the major abstractions have either no runtime overhead or can be opted out easily.284 291 Like C, the building blocks of \CFA are structures and routines. 285 292 Virtually all of the code generated by the \CFA translator respects C memory layouts and calling conventions. 286 While \CFA is not object oriented, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C does havea notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}.287 While some \CFA features are common in object-oriented programming, they are independent capabilitiesallowing \CFA to adopt them while maintaining a procedural paradigm.293 While \CFA is not object oriented, lacking the concept of a receiver (\eg @this@) and nominal inheritance-relationships, C has a notion of objects: ``region of data storage in the execution environment, the contents of which can represent values''~\cite[3.15]{C11}. 294 While some object-oriented features appear in \CFA, they are independent capabilities, allowing \CFA to adopt them while maintaining a procedural paradigm. 288 295 289 296 … … 338 345 Object-oriented programming languages only provide implicit qualification for the receiver. 339 346 340 In detail, the @with@ statement has the form:347 In detail, the @with@-statement syntax is: 341 348 \begin{cfa} 342 349 $\emph{with-statement}$: … … 348 355 (Enumerations are already opened.) 349 356 The object is the implicit qualifier for the open structure-fields. 350 All expressions in the expression list are open in parallel within the compound statement, which is different from Pascal, which nests the openings from left to right.357 All expressions in the expression list are opened in parallel within the compound statement, which is different from Pascal, which nests the openings from left to right. 351 358 352 359 … … 354 361 355 362 \CFA maximizes the ability to reuse names via overloading to aggressively address the naming problem. 356 Both variables and routines may be overloaded, where selection is based on types, and number of returns (as in Ada~\cite{Ada}) and arguments. 363 Both variables and routines may be overloaded, where selection is based on number and types of returns and arguments (as in Ada~\cite{Ada}). 364 \newpage 365 \vspace*{-2\baselineskip}%??? 357 366 \begin{cquote} 358 \vspace*{-\baselineskip}%??? 367 \begin{cfa} 368 // selection based on type 369 \end{cfa} 359 370 \lstDeleteShortInline@% 360 \begin{cfa}361 // selection based on type362 \end{cfa}363 371 \begin{tabular}{@{}l@{\hspace{2\parindentlnth}}|@{\hspace{2\parindentlnth}}l@{}} 364 372 \begin{cfa} … … 411 419 Therefore, overloading eliminates long prefixes and other naming conventions to prevent name clashes. 412 420 As seen in Section~\ref{s:Concurrency}, routine @main@ is heavily overloaded. 413 For example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name:421 As another example, variable overloading is useful in the parallel semantics of the @with@ statement for fields with the same name: 414 422 \begin{cfa} 415 423 struct S { int `i`; int j; double m; } s; … … 464 472 \lstMakeShortInline@% 465 473 \end{cquote} 466 While concurrency does not use operator overloading directly, it provides an introduction for the syntax of constructors.467 474 468 475 … … 470 477 471 478 Object lifetime is a challenge in non-managed programming languages. 472 \CFA responds with \CC-like constructors and destructors :479 \CFA responds with \CC-like constructors and destructors, using a different operator-overloading syntax. 473 480 \begin{cfa} 474 481 struct VLA { int len, * data; }; $\C{// variable length array of integers}$ … … 490 497 Like \CC, construction is implicit on allocation (stack/heap) and destruction is implicit on deallocation. 491 498 The object and all their fields are constructed/destructed. 492 \CFA also provides @new@ and @delete@ , which behave like @malloc@ and @free@, in addition to constructing and destructing objects:499 \CFA also provides @new@ and @delete@ as library routines, which behave like @malloc@ and @free@, in addition to constructing and destructing objects: 493 500 \begin{cfa} 494 501 { … … 518 525 int i = sum( sa, 5 ); $\C{// use S's 0 construction and +}$ 519 526 \end{cfa} 520 The builtin type @zero_t@ (and @one_t@) overload constant 0 (and 1) for a new types, where both 0 and 1 have special meaning in C. 527 Type variables can be @otype@ or @dtype@. 528 @otype@ refers to a \emph{complete type}, \ie, a type with size, alignment, default constructor, copy constructor, destructor, and assignment operator. 529 @dtype@ refers to an \emph{incomplete type}, \eg, void or a forward-declared type. 530 The builtin types @zero_t@ and @one_t@ overload constant 0 and 1 for a new types, where both 0 and 1 have special meaning in C. 521 531 522 532 \CFA provides \newterm{traits} to name a group of type assertions, where the trait name allows specifying the same set of assertions in multiple locations, preventing repetition mistakes at each routine declaration: … … 533 543 \end{cfa} 534 544 535 Assertions can be @otype@ or @dtype@.536 @otype@ refers to a \emph{complete} object, \ie an object has a size, alignment, default constructor, copy constructor, destructor and an assignment operator.537 @dtype@ refers to an \emph{incomplete} object, \ie, an object only has a size and alignment.538 539 545 Using the return type for overload discrimination, it is possible to write a type-safe @alloc@ based on the C @malloc@: 540 546 \begin{cfa} … … 554 560 In coroutining, the single thread is self-scheduling across the stacks, so execution is deterministic, \ie the execution path from input to output is fixed and predictable. 555 561 A \newterm{stackless} coroutine executes on the caller's stack~\cite{Python} but this approach is restrictive, \eg preventing modularization and supporting only iterator/generator-style programming; 556 a \newterm{stackful l} coroutine executes on its own stack, allowing full generality.557 Only stackful lcoroutines are a stepping stone to concurrency.562 a \newterm{stackful} coroutine executes on its own stack, allowing full generality. 563 Only stackful coroutines are a stepping stone to concurrency. 558 564 559 565 The transition to concurrency, even for execution with a single thread and multiple stacks, occurs when coroutines also context switch to a \newterm{scheduling oracle}, introducing non-determinism from the coroutine perspective~\cite[\S~3]{Buhr05a}. … … 561 567 The resulting execution system now follows a cooperative threading-model, called \newterm{non-preemptive scheduling}. 562 568 563 Because the scheduler is special, it can either be a stackless or stackful lcoroutine.569 Because the scheduler is special, it can either be a stackless or stackful coroutine. 564 570 For stackless, the scheduler performs scheduling on the stack of the current coroutine and switches directly to the next coroutine, so there is one context switch. 565 For stackful l, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches.566 A stackful l scheduler is often used for simplicity and security, even through there is a slightly higher runtime-cost.571 For stackful, the current coroutine switches to the scheduler, which performs scheduling, and it then switches to the next coroutine, so there are two context switches. 572 A stackful scheduler is often used for simplicity and security. 567 573 568 574 Regardless of the approach used, a subset of concurrency related challenges start to appear. 569 575 For the complete set of concurrency challenges to occur, the missing feature is \newterm{preemption}, where context switching occurs randomly between any two instructions, often based on a timer interrupt, called \newterm{preemptive scheduling}. 570 While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty where context switches occur.576 While a scheduler introduces uncertainty in the order of execution, preemption introduces uncertainty about where context switches occur. 571 577 Interestingly, uncertainty is necessary for the runtime (operating) system to give the illusion of parallelism on a single processor and increase performance on multiple processors. 572 578 The reason is that only the runtime has complete knowledge about resources and how to best utilized them. … … 574 580 otherwise, it is impossible to write meaningful programs. 575 581 Optimal performance in concurrent applications is often obtained by having as much non-determinism as correctness allows. 576 577 578 \subsection{\protect\CFA's Thread Building Blocks}579 582 580 583 An important missing feature in C is threading\footnote{While the C11 standard defines a \protect\lstinline@threads.h@ header, it is minimal and defined as optional. … … 594 597 This capability is accomplish via the coroutine's stack, where suspend/resume context switch among stacks. 595 598 Because threading design-challenges are present in coroutines, their design effort is relevant, and this effort can be easily exposed to programmers giving them a useful new programming paradigm because a coroutine handles the class of problems that need to retain state between calls, \eg plugins, device drivers, and finite-state machines. 596 Therefore, the core \CFA coroutine-API for has two fundamental features:independent call-stacks and @suspend@/@resume@ operations.599 Therefore, the two fundamental features of the core \CFA coroutine-API are independent call-stacks and @suspend@/@resume@ operations. 597 600 598 601 For example, a problem made easier with coroutines is unbounded generators, \eg generating an infinite sequence of Fibonacci numbers … … 722 725 Figure~\ref{f:Coroutine3States} creates a @coroutine@ type, @`coroutine` Fib { int fn; }@, which provides communication, @fn@, for the \newterm{coroutine main}, @main@, which runs on the coroutine stack, and possibly multiple interface routines, \eg @next@. 723 726 Like the structure in Figure~\ref{f:ExternalState}, the coroutine type allows multiple instances, where instances of this type are passed to the (overloaded) coroutine main. 724 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code has the three suspend points, representing the three states in the Fibonacci formula, to context switch back to the caller's @resume@.727 The coroutine main's stack holds the state for the next generation, @f1@ and @f2@, and the code represents the three states in the Fibonacci formula via the three suspend points, to context switch back to the caller's @resume@. 725 728 The interface routine @next@, takes a Fibonacci instance and context switches to it using @resume@; 726 729 on restart, the Fibonacci field, @fn@, contains the next value in the sequence, which is returned. … … 835 838 The previous examples are \newterm{asymmetric (semi) coroutine}s because one coroutine always calls a resuming routine for another coroutine, and the resumed coroutine always suspends back to its last resumer, similar to call/return for normal routines. 836 839 However, @resume@ and @suspend@ context switch among existing stack-frames, rather than create new ones so there is no stack growth. 837 \newterm{Symmetric (full) coroutine}s have a coroutine call to a resuming routine for another coroutine, and its coroutine main has a call toanother resuming routine, which eventually forms a resuming-call cycle.840 \newterm{Symmetric (full) coroutine}s have a coroutine call to a resuming routine for another coroutine, and its coroutine main calls another resuming routine, which eventually forms a resuming-call cycle. 838 841 (The trivial cycle is a coroutine resuming itself.) 839 842 This control flow is similar to recursion for normal routines, but again there is no stack growth from the context switch. … … 926 929 The call from the consumer to @payment@ introduces the cycle between producer and consumer. 927 930 When @payment@ is called, the consumer copies values into the producer's communication variable and a resume is executed. 928 The context switch restarts the producer at the point where it waslast context switched, so it continues in @delivery@ after the resume.931 The context switch restarts the producer at the point where it last context switched, so it continues in @delivery@ after the resume. 929 932 930 933 @delivery@ returns the status value in @prod@'s coroutine main, where the status is printed. … … 956 959 For example, for threads if the thread is implicitly started, it must start \emph{after} all constructors, because the thread relies on a completely initialized object, but the inherited constructor runs \emph{before} the derived. 957 960 958 An alternative lyis composition:961 An alternative is composition: 959 962 \begin{cfa} 960 963 struct mycoroutine { … … 1004 1007 Users wanting to extend coroutines or build their own for various reasons can only do so in ways offered by the language. 1005 1008 Furthermore, implementing coroutines without language supports also displays the power of a programming language. 1006 While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without using thelanguage support.1007 The reserved keyword simply eases use for the common case s.1008 1009 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait is used to restrictcoroutine-manipulation routines:1009 While this is ultimately the option used for idiomatic \CFA code, coroutines and threads can still be constructed without language support. 1010 The reserved keyword simply eases use for the common case. 1011 1012 Part of the mechanism to generalize coroutines is using a \CFA trait, which defines a coroutine as anything satisfying the trait @is_coroutine@, and this trait restricts the available set of coroutine-manipulation routines: 1010 1013 \begin{cfa} 1011 1014 trait is_coroutine( `dtype` T ) { … … 1019 1022 The routine definitions ensures there is a statically-typed @main@ routine that is the starting point (first stack frame) of a coroutine, and a mechanism to get (read) the currently executing coroutine handle. 1020 1023 The @main@ routine has no return value or additional parameters because the coroutine type allows an arbitrary number of interface routines with corresponding arbitrary typed input/output values versus fixed ones. 1021 The generic routines @suspend@ and @resume@ can be redefined, but any object passed to them is a coroutine since it must satisfy the @is_coroutine@ trait to compile.1022 1024 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ routine, and possibly redefining @suspend@ and @resume@. 1023 1025 The \CFA keyword @coroutine@ implicitly implements the getter and forward declarations required for implementing the coroutine main: … … 1150 1152 \begin{cfa} 1151 1153 int main() { 1152 MyThread * heapLive d;1154 MyThread * heapLive; 1153 1155 { 1154 MyThread blockLive d;$\C{// fork block-based thread}$1155 heapLive d= `new`( MyThread ); $\C{// fork heap-based thread}$1156 MyThread blockLive; $\C{// fork block-based thread}$ 1157 heapLive = `new`( MyThread ); $\C{// fork heap-based thread}$ 1156 1158 ... 1157 1159 } $\C{// join block-based thread}$ 1158 1160 ... 1159 `delete`( heapLive d); $\C{// join heap-based thread}$1161 `delete`( heapLive ); $\C{// join heap-based thread}$ 1160 1162 } 1161 1163 \end{cfa} 1162 1164 The heap-based approach allows arbitrary thread-creation topologies, with respect to fork/join-style concurrency. 1163 1165 1164 Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequential , after all the row threads have terminated.1166 Figure~\ref{s:ConcurrentMatrixSummation} shows concurrently adding the rows of a matrix and then totalling the subtotals sequentially, after all the row threads have terminated. 1165 1167 The program uses heap-based threads because each thread needs different constructor values. 1166 1168 (Python provides a simple iteration mechanism to initialize array elements to different values allowing stack allocation.) … … 1215 1217 However, for productivity it is always desirable to use the highest-level construct that provides the necessary efficiency~\cite{Hochstein05}. 1216 1218 A newer approach for restricting non-determinism is transactional memory~\cite{Herlihy93}. 1217 While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it was rejected as the core paradigm for concurrency in \CFA.1219 While this approach is pursued in hardware~\cite{Nakaike15} and system languages, like \CC~\cite{Cpp-Transactions}, the performance and feature set is still too restrictive to be the main concurrency paradigm for system languages, which is why it is rejected as the core paradigm for concurrency in \CFA. 1218 1220 1219 1221 One of the most natural, elegant, and efficient mechanisms for mutual exclusion and synchronization for shared-memory systems is the \emph{monitor}. … … 1244 1246 higher-level mechanisms often simplify usage by adding better coupling between synchronization and data, \eg message passing, or offering a simpler solution to otherwise involved challenges, \eg barrier lock. 1245 1247 Often synchronization is used to order access to a critical section, \eg ensuring a reader thread is the next kind of thread to enter a critical section. 1246 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader has\newterm{barged}.1248 If a writer thread is scheduled for next access, but another reader thread acquires the critical section first, that reader \newterm{barged}. 1247 1249 Barging can result in staleness/freshness problems, where a reader barges ahead of a writer and reads temporally stale data, or a writer barges ahead of another writer overwriting data with a fresh value preventing the previous value from ever being read (lost computation). 1248 1250 Preventing or detecting barging is an involved challenge with low-level locks, which can be made much easier by higher-level constructs. … … 1314 1316 \end{cfa} 1315 1317 1316 Mandatory monitor qualifiers have the benefit of being self-documenting, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant.1318 The benefit of mandatory monitor qualifiers is self-documentation, but requiring both @mutex@ and \lstinline[morekeywords=nomutex]@nomutex@ for all monitor parameter is redundant. 1317 1319 Instead, one of qualifier semantics can be the default, and the other required. 1318 1320 For example, assume the safe @mutex@ qualifier for all monitor parameters because assuming \lstinline[morekeywords=nomutex]@nomutex@ may cause subtle errors. … … 1340 1342 1341 1343 To make the issue tractable, \CFA only acquires one monitor per parameter with at most one level of indirection. 1342 However, the C type-system has an ambiguitywith respects to arrays.1344 However, there is an ambiguity in the C type-system with respects to arrays. 1343 1345 Is the argument for @f2@ a single object or an array of objects? 1344 1346 If it is an array, only the first element of the array is acquired, which seems unsafe; … … 1355 1357 1356 1358 For object-oriented monitors, calling a mutex member \emph{implicitly} acquires mutual exclusion of the receiver object, @`rec`.foo(...)@. 1357 \CFA has no receiver, and hence, must use an explicit mechanism to specify which object has mutual exclusion acquired.1359 \CFA has no receiver, and hence, must use an explicit mechanism to specify which object acquires mutual exclusion. 1358 1360 A positive consequence of this design decision is the ability to support multi-monitor routines. 1359 1361 \begin{cfa} … … 1387 1389 While \CFA provides only a partial solution, the \CFA partial solution handles many useful cases. 1388 1390 \begin{cfa} 1389 monitor Bank { ... };1390 void deposit( Bank & `mutex` b, int deposit );1391 void transfer( Bank & `mutex` mybank, Bank & `mutex` yourbank, int me2you) {1392 deposit( my bank, `-`me2you );$\C{// debit}$1393 deposit( your bank, me2you );$\C{// credit}$1391 monitor BankAccount { ... }; 1392 void deposit( BankAccount & `mutex` b, int deposit ); 1393 void transfer( BankAccount & `mutex` my, BankAccount & `mutex` your, int me2you ) { 1394 deposit( my, `-`me2you ); $\C{// debit}$ 1395 deposit( your, me2you ); $\C{// credit}$ 1394 1396 } 1395 1397 \end{cfa} … … 1398 1400 1399 1401 1400 \subsection{\protect\lstinline|mutex| statement} \label{mutex-stmt} 1402 \subsection{\protect\lstinline@mutex@ statement} 1403 \label{mutex-stmt} 1401 1404 1402 1405 The monitor call-semantics associate all locking semantics to routines. … … 1525 1528 \end{figure} 1526 1529 1527 Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until the buffer has a free/empty slot.1530 Figure~\ref{f:BBExt} shows a \CFA generic bounded-buffer with external scheduling, where producers/consumers detecting a full/empty buffer block and prevent more producers/consumers from entering the monitor until there is a free/empty slot in the buffer. 1528 1531 External scheduling is controlled by the @waitfor@ statement, which atomically blocks the calling thread, releases the monitor lock, and restricts the routine calls that can next acquire mutual exclusion. 1529 1532 If the buffer is full, only calls to @remove@ can acquire the buffer, and if the buffer is empty, only calls to @insert@ can acquire the buffer. … … 1646 1649 } 1647 1650 \end{cfa} 1648 must have acquired monitor locks that aregreater than or equal to the number of locks for the waiting thread signalled from the condition queue.1651 must acquire a set of monitor locks greater than or equal to the number of locks for the waiting thread signalled from the condition queue. 1649 1652 1650 1653 Similarly, for @waitfor( rtn )@, the default semantics is to atomically block the acceptor and release all acquired mutex types in the parameter list, \ie @waitfor( rtn, m1, m2 )@. … … 1653 1656 To statically verify the released monitors match with the accepted routine's mutex parameters, the routine (pointer) prototype must be accessible. 1654 1657 1655 When an overloaded routine appears in an @waitfor@ statement, calls to any routine with that name are accepted.1656 The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call.1658 % When an overloaded routine appears in an @waitfor@ statement, calls to any routine with that name are accepted. 1659 % The rationale is that members with the same name should perform a similar function, and therefore, all should be eligible to accept a call. 1657 1660 As always, overloaded routines can be disambiguated using a cast: 1658 1661 \begin{cfa} 1659 1662 void rtn( M & mutex m ); 1660 1663 `int` rtn( M & mutex m ); 1661 waitfor( (`int` (*)( M & mutex ))rtn, m 1, m2);1664 waitfor( (`int` (*)( M & mutex ))rtn, m ); 1662 1665 \end{cfa} 1663 1666 … … 1681 1684 For example, there are no loops in either bounded buffer solution in Figure~\ref{f:GenericBoundedBuffer}. 1682 1685 Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in the design and implementation of \CFA concurrency. 1683 Furthermore, \CFA concurrency has no superious wakeup~\cite[\S~9]{Buhr05a}, which eliminates an implict form of barging.1686 Furthermore, there is no spurious wakeup~\cite[\S~9]{Buhr05a} in \CFA, which eliminates an implict form of barging. 1684 1687 1685 1688 … … 1753 1756 1754 1757 One scheduling solution is for the signaller to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling. 1755 However, Figure~\ref{f:OtherWaitingThread} shows this solution is complex depending on other waiters, resulting i s choices when the signaller finishes the inner mutex-statement.1756 The si ngaller can retain @m2@ until completion of the outer mutex statement and pass the locks to waiter W1, or it can pass @m2@ to waiter W2 after completing the inner mutex-statement, while continuing to hold @m1@.1758 However, Figure~\ref{f:OtherWaitingThread} shows this solution is complex depending on other waiters, resulting in options when the signaller finishes the inner mutex-statement. 1759 The signaller can retain @m2@ until completion of the outer mutex statement and pass the locks to waiter W1, or it can pass @m2@ to waiter W2 after completing the inner mutex-statement, while continuing to hold @m1@. 1757 1760 In the latter case, waiter W2 must eventually pass @m2@ to waiter W1, which is complex because W1 may have waited before W2, so W2 is unaware of it. 1758 1761 Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves. … … 1762 1765 Partial signalling transfers ownership of monitors to the front waiter. 1763 1766 When the signaller thread exits or waits in the monitor the front waiter is unblocked if all its monitors are released. 1764 Th is solution has the benefit that complexity is encapsulatedinto only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met.1767 The benefit of this solution is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. 1765 1768 1766 1769 … … 1829 1832 waitfor( f, m2 ); $\C{\color{red}// wait for call to f with argument m2}$ 1830 1833 \end{cfa} 1831 Routine @g@ has acquired both locks, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@, while @g@ still holds lock @m1@.1834 Both locks are acquired by routine @g@, so when routine @f@ is called, the lock for monitor @m2@ is passed from @g@ to @f@, while @g@ still holds lock @m1@. 1832 1835 This behaviour can be extended to the multi-monitor @waitfor@ statement. 1833 1836 \begin{cfa} … … 1925 1928 When the buffer is deallocated, the current waiter is unblocked and informed, so it can perform an appropriate action. 1926 1929 However, the basic @waitfor@ semantics do not support this functionality, since using an object after its destructor is called is undefined. 1927 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unbloc ed from the urgent queue to deallocate the object.1930 Therefore, to make this useful capability work, the semantics for accepting the destructor is the same as @signal@, \ie the call to the destructor is placed on the urgent queue and the acceptor continues execution, which throws an exception to the acceptor and then the caller is unblocked from the urgent queue to deallocate the object. 1928 1931 Accepting the destructor is an idiomatic way to terminate a thread in \CFA. 1929 1932 … … 1933 1936 Threads in \CFA are monitors to allow direct communication among threads, \ie threads can have mutex routines that are called by other threads. 1934 1937 Hence, all monitor features are available when using threads. 1935 Figure~\ref{f:pingpong} shows an example of two threads directly calling each other and accepting calls from each other in a cycle. 1936 Note, both ping/pong threads are globally declared, @pi@/@po@, and hence, start (and possibly complete) before the program main starts. 1937 1938 \begin{figure} 1939 \lstDeleteShortInline@% 1940 \begin{cquote} 1938 The following shows an example of two threads directly calling each other and accepting calls from each other in a cycle. 1941 1939 \begin{cfa} 1942 1940 thread Ping {} pi; … … 1946 1944 int main() {} 1947 1945 \end{cfa} 1946 \vspace{-0.8\baselineskip} 1947 \begin{cquote} 1948 1948 \begin{tabular}{@{}l@{\hspace{3\parindentlnth}}l@{}} 1949 1949 \begin{cfa} … … 1965 1965 \end{cfa} 1966 1966 \end{tabular} 1967 \lstMakeShortInline@%1968 1967 \end{cquote} 1969 \caption{Threads ping/pong using external scheduling} 1970 \label{f:pingpong} 1971 \end{figure} 1968 % \lstMakeShortInline@% 1969 % \caption{Threads ping/pong using external scheduling} 1970 % \label{f:pingpong} 1971 % \end{figure} 1972 Note, the ping/pong threads are globally declared, @pi@/@po@, and hence, start (and possibly complete) before the program main starts. 1973 1974 1975 \subsection{Low-level Locks} 1976 1977 For completeness and efficiency, \CFA provides a standard set of low-level locks: recursive mutex, condition, semaphore, barrier, \etc, and atomic instructions: @fetchAssign@, @fetchAdd@, @testSet@, @compareSet@, \etc. 1978 1972 1979 1973 1980 \section{Parallelism} 1974 1981 1975 1982 Historically, computer performance was about processor speeds. 1976 However, with heat dissipation being a direct consequence of speed increase, parallelism has becomethe new source for increased performance~\cite{Sutter05, Sutter05b}.1983 However, with heat dissipation being a direct consequence of speed increase, parallelism is the new source for increased performance~\cite{Sutter05, Sutter05b}. 1977 1984 Now, high-performance applications must care about parallelism, which requires concurrency. 1978 1985 The lowest-level approach of parallelism is to use \newterm{kernel threads} in combination with semantics like @fork@, @join@, \etc. … … 2000 2007 \subsection{Thread Pools} 2001 2008 2002 In contrast to direct threading is indirect \newterm{thread pools}, where small jobs (work units) are insert into a work pool for execution.2009 In contrast to direct threading is indirect \newterm{thread pools}, where small jobs (work units) are inserted into a work pool for execution. 2003 2010 If the jobs are dependent, \ie interact, there is an implicit/explicit dependency graph that ties them together. 2004 2011 While removing direct concurrency, and hence the amount of context switching, thread pools significantly limit the interaction that can occur among jobs. 2005 Indeed, jobs should not block because that also block the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers.2012 Indeed, jobs should not block because that also blocks the underlying thread, which effectively means the CPU utilization, and therefore throughput, suffers. 2006 2013 While it is possible to tune the thread pool with sufficient threads, it becomes difficult to obtain high throughput and good core utilization as job interaction increases. 2007 2014 As well, concurrency errors return, which threads pools are suppose to mitigate. 2008 Intel's TBB library~\cite{TBB} is the gold standard for thread pools.2009 2015 2010 2016 … … 2036 2042 The user cluster is created to contain the application user-threads. 2037 2043 Having all threads execute on the one cluster often maximizes utilization of processors, which minimizes runtime. 2038 However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), it is sometimes necessary to have multiple clusters.2044 However, because of limitations of the underlying operating system, heterogeneous hardware, or scheduling requirements (real-time), multiple clusters are sometimes necessary. 2039 2045 2040 2046 … … 2074 2080 This storage is allocated at the base of a thread's stack before blocking, which means programmers must add a small amount of extra space for stacks. 2075 2081 2076 In \CFA, ordering of monitor acquisition relies on memory ordering to prevent deadlock~\cite{Havender68}, because all objects are guaranteed tohave distinct non-overlapping memory layouts, and mutual-exclusion for a monitor is only defined for its lifetime.2082 In \CFA, ordering of monitor acquisition relies on memory ordering to prevent deadlock~\cite{Havender68}, because all objects have distinct non-overlapping memory layouts, and mutual-exclusion for a monitor is only defined for its lifetime. 2077 2083 When a mutex call is made, pointers to the concerned monitors are aggregated into a variable-length array and sorted. 2078 2084 This array persists for the entire duration of the mutual-exclusion and its ordering reused extensively. … … 2085 2091 This method is a 2-step context-switch and provides a clear distinction between user and kernel code, where scheduling and other system operations happen. 2086 2092 The alternative 1-step context-switch uses the \emph{from} thread's stack to schedule and then context-switches directly to the \emph{to} thread's stack. 2087 Experimental results (not presented) show the performance difference between these two approaches is virtually equivalent, because the 1-step performance is dominated by a lock instructionto prevent a race condition.2093 Experimental results (not presented) show the performance difference between these two approaches is virtually equivalent, because both approaches are dominated by locking to prevent a race condition. 2088 2094 2089 2095 All kernel threads (@pthreads@) created a stack. … … 2097 2103 When the timer expires, an interrupt is delivered, and the interrupt handler resets the count-down timer, and if the virtual processor is executing in user code, the signal handler performs a user-level context-switch, or if executing in the language runtime-kernel, the preemption is ignored or rolled forward to the point where the runtime kernel context switches back to user code. 2098 2104 Multiple signal handlers may be pending. 2099 When control eventually switches back to the signal handler, it returns normally, and execution continues in the interrupted user thread, even though the return from the signal handler may be on a different kernel thread than the one where the signal was delivered.2105 When control eventually switches back to the signal handler, it returns normally, and execution continues in the interrupted user thread, even though the return from the signal handler may be on a different kernel thread than the one where the signal is delivered. 2100 2106 The only issue with this approach is that signal masks from one kernel thread may be restored on another as part of returning from the signal handler; 2101 therefore, all virtual processors in a cluster need to have the same signal mask.2107 therefore, the same signal mask is required for all virtual processors in a cluster. 2102 2108 2103 2109 However, on current UNIX systems: … … 2277 2283 Figure~\ref{f:int-sched} shows the code for \CFA, with results in Table~\ref{tab:int-sched}. 2278 2284 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 2285 Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking. 2279 2286 2280 2287 \begin{figure} … … 2439 2446 However, no single scheduler is optimal for all workloads and therefore there is value in being able to change the scheduler for given programs. 2440 2447 One solution is to offer various tweaking options, allowing the scheduler to be adjusted to the requirements of the workload. 2441 However, to be truly flexible, it is necessary to have a pluggable scheduler.2448 However, to be truly flexible, a pluggable scheduler is necessary. 2442 2449 Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion. 2443 2450 … … 2449 2456 At its core, non-blocking I/O is an operating-system level feature queuing IO operations, \eg network operations, and registering for notifications instead of waiting for requests to complete. 2450 2457 Current trends use asynchronous programming like callbacks, futures, and/or promises, \eg Node.js~\cite{NodeJs} for JavaScript, Spring MVC~\cite{SpringMVC} for Java, and Django~\cite{Django} for Python. 2451 However, these solutions lead to code that is hard create, read and maintain.2458 However, these solutions lead to code that is hard to create, read and maintain. 2452 2459 A better approach is to tie non-blocking I/O into the concurrency system to provide ease of use with low overhead, \eg thread-per-connection web-services. 2453 2460 A non-blocking I/O library is currently under development for \CFA. … … 2459 2466 Examples of such tools can include futures and promises~\cite{promises}, executors and actors. 2460 2467 These additional features are useful when monitors offer a level of abstraction that is inadequate for certain tasks. 2468 As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks. 2461 2469 2462 2470 \paragraph{Implicit Threading} … … 2467 2475 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}. 2468 2476 The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanism to solve basic HPC data-concurrency problems. 2469 However, implicit concurrency is a restrictive solution and has itslimitations, so it can never replace explicit concurrent programming.2477 However, implicit concurrency is a restrictive solution with significant limitations, so it can never replace explicit concurrent programming. 2470 2478 2471 2479
Note: See TracChangeset
for help on using the changeset viewer.