Changeset 8b58bae for doc/papers/concurrency/Paper.tex
- Timestamp:
- Jun 24, 2020, 5:00:59 PM (4 years ago)
- Branches:
- ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
- Children:
- c953163
- Parents:
- 9791ab5 (diff), 7f9968a (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the(diff)
links above to see all the changes relative to each parent. - File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
doc/papers/concurrency/Paper.tex
r9791ab5 r8b58bae 292 292 293 293 \CFA~\cite{Moss18,Cforall} is a modern, polymorphic, non-object-oriented\footnote{ 294 \CFA has object-oriented features, such as constructors, destructors, virtualsand simple trait/interface inheritance.294 \CFA has object-oriented features, such as constructors, destructors, and simple trait/interface inheritance. 295 295 % Go interfaces, Rust traits, Swift Protocols, Haskell Type Classes and Java Interfaces. 296 296 % "Trait inheritance" works for me. "Interface inheritance" might also be a good choice, and distinguish clearly from implementation inheritance. 297 % You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code. Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type". 298 However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions implemented by an implicit \lstinline@this@ (receiver) parameter.}, 297 % You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code. 298 % Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type". 299 However, functions \emph{cannot} be nested in structures and there is no mechanism to designate a function parameter as a receiver, \lstinline@this@, parameter.}, 299 300 backwards-compatible extension of the C programming language. 300 301 In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a vehicle for new typing and control-flow capabilities on top of a highly popular programming language\footnote{ … … 317 318 Coroutines are only a stepping stone towards concurrency where the commonality is that coroutines and threads retain state between calls. 318 319 319 \Celeven /\CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone}320 \Celeven and \CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone} 320 321 Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC). 321 322 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC. … … 392 393 \label{s:FundamentalExecutionProperties} 393 394 394 The features in a programming language should be composed froma set of fundamental properties rather than an ad hoc collection chosen by the designers.395 The features in a programming language should be composed of a set of fundamental properties rather than an ad hoc collection chosen by the designers. 395 396 To this end, the control-flow features created for \CFA are based on the fundamental properties of any language with function-stack control-flow (see also \uC~\cite[pp.~140-142]{uC++}). 396 The fundamental properties are execution state, thread, and mutual-exclusion/synchronization (MES).397 The fundamental properties are execution state, thread, and mutual-exclusion/synchronization. 397 398 These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread. 398 399 While it is possible for a language to only provide threads for composing programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient solutions to certain classes of problems. 399 400 As is shown, each of the non-rejected composed language features solves a particular set of problems, and hence, has a defensible position in a programming language. 400 If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or i s inefficient solution.401 If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or inefficient solution. 401 402 402 403 In detail, the fundamental properties are: 403 404 \begin{description}[leftmargin=\parindent,topsep=3pt,parsep=0pt] 404 405 \item[\newterm{execution state}:] 405 is the state information needed by a control-flow feature to initialize, manage compute data and execution location(s), and de-initialize, \eg calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations. 406 is the state information needed by a control-flow feature to initialize and manage both compute data and execution location(s), and de-initialize. 407 For example, calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations. 406 408 State is retained in fixed-sized aggregate structures (objects) and dynamic-sized stack(s), often allocated in the heap(s) managed by the runtime system. 407 409 The lifetime of state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost. … … 414 416 Multiple threads provide \emph{concurrent execution}; 415 417 concurrent execution becomes parallel when run on multiple processing units, \eg hyper-threading, cores, or sockets. 416 There must be language mechanisms to create, block and unblock, and join with a thread, even if the mechanism is indirect.417 418 \item[\newterm{ MES}:]419 is the concurrency mechanism sto perform an action without interruption and establish timing relationships among multiple threads.418 A programmer needs mechanisms to create, block and unblock, and join with a thread, even if these basic mechanisms are supplied indirectly through high-level features. 419 420 \item[\newterm{mutual-exclusion / synchronization (MES)}:] 421 is the concurrency mechanism to perform an action without interruption and establish timing relationships among multiple threads. 420 422 We contented these two properties are independent, \ie mutual exclusion cannot provide synchronization and vice versa without introducing additional threads~\cite[\S~4]{Buhr05a}. 421 Limiting MES , \eg no access to shared data,results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.423 Limiting MES functionality results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design. 422 424 \end{description} 423 These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without atomichardware mechanisms.425 These properties are fundamental as they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without (atomic) hardware mechanisms. 424 426 425 427 … … 443 445 \renewcommand{\arraystretch}{1.25} 444 446 %\setlength{\tabcolsep}{5pt} 447 \vspace*{-5pt} 445 448 \begin{tabular}{c|c||l|l} 446 449 \multicolumn{2}{c||}{execution properties} & \multicolumn{2}{c}{mutual exclusion / synchronization} \\ … … 461 464 Yes (stackful) & Yes & \textbf{11}\ \ \ @thread@ & \textbf{12}\ \ @mutex@ @thread@ \\ 462 465 \end{tabular} 466 \vspace*{-8pt} 463 467 \end{table} 464 468 … … 468 472 A @mutex@ structure, often called a \newterm{monitor}, provides a high-level interface for race-free access of shared data in concurrent programming-languages. 469 473 Case 3 is case 1 where the structure can implicitly retain execution state and access functions use this execution state to resume/suspend across \emph{callers}, but resume/suspend does not retain a function's local state. 470 A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.474 A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrows the caller's stack and thread, but the stack is used only to preserve state across its callees not callers. 471 475 Generators provide the first step toward directly solving problems like finite-state machines that retain data and execution state between calls, whereas normal functions restart on each call. 472 476 Case 4 is cases 2 and 3 with thread safety during execution of the generator's access functions. … … 475 479 A stackful generator, often called a \newterm{coroutine}, is \newterm{stackful} because resume/suspend now context switch to/from the caller's and coroutine's stack. 476 480 A coroutine extends the state retained between calls beyond the generator's structure to arbitrary call depth in the access functions. 477 Cases 7 and 8are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack.478 Cases 9 and 10 are rejected because a thread needs a growable stack to accept calls, make calls, block, or be preempted, all of which compound to require an unknown amount of execution state.479 If this kind ofthread exists, it must execute to completion, \ie computation only, which severely restricts runtime management.481 Cases 7, 8, 9 and 10 are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack. 482 For cases 9 and 10, the stackless frame is not growable, precluding accepting nested calls, making calls, blocking as it requires calls, or preemption as it requires pushing an interrupt frame, all of which compound to require an unknown amount of execution state. 483 Hence, if this kind of uninterruptable thread exists, it must execute to completion, \ie computation only, which severely restricts runtime management. 480 484 Cases 11 and 12 are a stackful thread with and without safe access to shared state. 481 485 A thread is the language mechanism to start another thread of control in a program with growable execution state for call/return execution. … … 1396 1400 The call to @start@ is the first @resume@ of @prod@, which remembers the program main as the starter and creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it. 1397 1401 @prod@'s coroutine main starts, creates local-state variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer's @deliver@ function to transfer the values, and printing the status returned from the consumer. 1398 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.1402 The producer's call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status. 1399 1403 Similarly on the first resume, @cons@'s stack is created and initialized, holding local-state variables retained between subsequent activations of the coroutine. 1400 1404 The symmetric coroutine cycle forms when the consumer calls the producer's @payment@ function, which resumes the producer in the consumer's delivery function. 1401 1405 When the producer calls @delivery@ again, it resumes the consumer in the @payment@ function. 1402 Both interface function than return to thetheir corresponding coroutine-main functions for the next cycle.1406 Both interface functions then return to their corresponding coroutine-main functions for the next cycle. 1403 1407 Figure~\ref{f:ProdConsRuntimeStacks} shows the runtime stacks of the program main, and the coroutine mains for @prod@ and @cons@ during the cycling. 1404 1408 As a consequence of a coroutine retaining its last resumer for suspending back, these reverse pointers allow @suspend@ to cycle \emph{backwards} around a symmetric coroutine cycle. … … 1414 1418 1415 1419 Terminating a coroutine cycle is more complex than a generator cycle, because it requires context switching to the program main's \emph{stack} to shutdown the program, whereas generators started by the program main run on its stack. 1416 Furthermore, each deallocated coroutine must execute all destructors for object allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep.1420 Furthermore, each deallocated coroutine must execute all destructors for objects allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep. 1417 1421 In the example, termination begins with the producer's loop stopping after N iterations and calling the consumer's @stop@ function, which sets the @done@ flag, resumes the consumer in function @payment@, terminating the call, and the consumer's loop in its coroutine main. 1418 1422 % (Not shown is having @prod@ raise a nonlocal @stop@ exception at @cons@ after it finishes generating values and suspend back to @cons@, which catches the @stop@ exception to terminate its loop.) … … 1438 1442 if @ping@ ends first, it resumes its starter the program main on return. 1439 1443 Regardless of the cycle complexity, the starter structure always leads back to the program main, but the path can be entered at an arbitrary point. 1440 Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, runn ning any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack.1441 Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation sem natics, regardless of the coroutine's state (terminated or active), like any other aggregate object.1444 Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, running any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack. 1445 Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation semantics, regardless of the coroutine's state (terminated or active), like any other aggregate object. 1442 1446 1443 1447 … … 1445 1449 1446 1450 A significant implementation challenge for generators and coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg the coroutine stack. 1447 There are several solutions to th eseproblem, which follow from the object-oriented flavour of adopting custom types.1451 There are several solutions to this problem, which follow from the object-oriented flavour of adopting custom types. 1448 1452 1449 1453 For object-oriented languages, inheritance is used to provide extra fields and code via explicit inheritance: … … 1480 1484 forall( `dtype` T | is_coroutine(T) ) void $suspend$( T & ), resume( T & ); 1481 1485 \end{cfa} 1482 Note, copying generators, coroutines, and threads is undefined because mul iple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy.1486 Note, copying generators, coroutines, and threads is undefined because multiple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy. 1483 1487 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer. 1484 1488 The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle. … … 1625 1629 MyThread * team = factory( 10 ); 1626 1630 // concurrency 1627 ` delete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$1631 `adelete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$ 1628 1632 } 1629 1633 \end{cfa} … … 1702 1706 Unrestricted nondeterminism is meaningless as there is no way to know when a result is completed and safe to access. 1703 1707 To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}. 1704 The shared data protected by mutual ex lusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems.1708 The shared data protected by mutual exclusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems. 1705 1709 Without synchronization control in a critical section, an arriving thread can barge ahead of preexisting waiter threads resulting in short/long-term starvation, staleness and freshness problems, and incorrect transfer of data. 1706 1710 Preventing or detecting barging is a challenge with low-level locks, but made easier through higher-level constructs. … … 1826 1830 \end{cquote} 1827 1831 The @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference or pointer. 1828 Similarly, the function definitions ensure sthere is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data.1832 Similarly, the function definitions ensure there is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data. 1829 1833 The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics. 1830 1834 \CFA relies heavily on traits as an abstraction mechanism, so the @mutex@ qualifier prevents coincidentally matching of a monitor trait with a type that is not a monitor, similar to coincidental inheritance where a shape and playing card can both be drawable. … … 2479 2483 2480 2484 One scheduling solution is for the signaller S to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling. 2481 However, this solution is inefficient if W2 waited first and can beimmediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement.2485 However, this solution is inefficient if W2 waited first and immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement. 2482 2486 If W1 waited first, the signaller must retain @m1@ amd @m2@ until completion of the outer mutex statement and then pass both to W1. 2483 2487 % Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves. 2484 To support th isefficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread.2488 To support these efficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread. 2485 2489 When a signaller exits or waits in a mutex function or statement, the front waiter on urgent is unblocked if all its monitors are released. 2486 Implementing a fast subset check for the necessar y released monitors is important and discussed in the following sections.2490 Implementing a fast subset check for the necessarily released monitors is important and discussed in the following sections. 2487 2491 % The benefit is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met. 2488 2492 … … 2543 2547 Hence, function pointers are used to identify the functions listed in the @waitfor@ statement, stored in a variable-sized array. 2544 2548 Then, the same implementation approach used for the urgent stack (see Section~\ref{s:Scheduling}) is used for the calling queue. 2545 Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transfer s.2549 Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transferred. 2546 2550 2547 2551 … … 2778 2782 The \CFA program @main@ uses the call/return paradigm to directly communicate with the @GoRtn main@, whereas Go switches to the unbuffered channel paradigm to indirectly communicate with the goroutine. 2779 2783 Communication by multiple threads is safe for the @gortn@ thread via mutex calls in \CFA or channel assignment in Go. 2780 The differen tbetween call and channel send occurs for buffered channels making the send asynchronous.2781 In \CFA, asynchronous call and multiple buffers isprovided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed).2784 The difference between call and channel send occurs for buffered channels making the send asynchronous. 2785 In \CFA, asynchronous call and multiple buffers are provided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed). 2782 2786 2783 2787 Figure~\ref{f:DirectCommunicationDatingService} shows the dating-service problem in Figure~\ref{f:DatingServiceMonitor} extended from indirect monitor communication to direct thread communication. 2784 When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does a nmuch work as possible.2788 When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does as much work as possible. 2785 2789 Notice, the dating server is postponing requests for an unspecified time while continuing to accept new requests. 2786 2790 For complex servers, \eg web-servers, there can be hundreds of lines of code in the thread main and safe interaction with clients can be complex. … … 2790 2794 2791 2795 For completeness and efficiency, \CFA provides a standard set of low-level locks: recursive mutex, condition, semaphore, barrier, \etc, and atomic instructions: @fetchAssign@, @fetchAdd@, @testSet@, @compareSet@, \etc. 2792 Some of these low-level mechanism are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible.2796 Some of these low-level mechanisms are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible. 2793 2797 2794 2798 … … 2980 2984 2981 2985 To test the performance of the \CFA runtime, a series of microbenchmarks are used to compare \CFA with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0. 2982 For comparison, the package must be multi-processor (M:N), which excludes libdil and /libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.2986 For comparison, the package must be multi-processor (M:N), which excludes libdil and libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing. 2983 2987 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and pthreads/\CFA/\uC are compiled with gcc 9.2.1. 2984 2988 … … 3049 3053 Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}. 3050 3054 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects. 3051 Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.3055 Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking. 3052 3056 3053 3057 \begin{multicols}{2} … … 3308 3312 This type of concurrency can be achieved both at the language level and at the library level. 3309 3313 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}. 3310 The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanism to solve basic HPC data-concurrency problems.3314 The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanisms to solve basic HPC data-concurrency problems. 3311 3315 However, implicit concurrency is a restrictive solution with significant limitations, so it can never replace explicit concurrent programming. 3312 3316
Note: See TracChangeset
for help on using the changeset viewer.