Ignore:
Timestamp:
Jun 24, 2020, 5:00:59 PM (4 years ago)
Author:
Thierry Delisle <tdelisle@…>
Branches:
ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
c953163
Parents:
9791ab5 (diff), 7f9968a (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.
Message:

Merge branch 'master' into relaxed_ready

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r9791ab5 r8b58bae  
    292292
    293293\CFA~\cite{Moss18,Cforall} is a modern, polymorphic, non-object-oriented\footnote{
    294 \CFA has object-oriented features, such as constructors, destructors, virtuals and simple trait/interface inheritance.
     294\CFA has object-oriented features, such as constructors, destructors, and simple trait/interface inheritance.
    295295% Go interfaces, Rust traits, Swift Protocols, Haskell Type Classes and Java Interfaces.
    296296% "Trait inheritance" works for me. "Interface inheritance" might also be a good choice, and distinguish clearly from implementation inheritance.
    297 % You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code. Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type".
    298 However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions implemented by an implicit \lstinline@this@ (receiver) parameter.},
     297% You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code.
     298% Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type".
     299However, functions \emph{cannot} be nested in structures and there is no mechanism to designate a function parameter as a receiver, \lstinline@this@, parameter.},
    299300backwards-compatible extension of the C programming language.
    300301In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a vehicle for new typing and control-flow capabilities on top of a highly popular programming language\footnote{
     
    317318Coroutines are only a stepping stone towards concurrency where the commonality is that coroutines and threads retain state between calls.
    318319
    319 \Celeven/\CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone}
     320\Celeven and \CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone}
    320321Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
    321322While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.
     
    392393\label{s:FundamentalExecutionProperties}
    393394
    394 The features in a programming language should be composed from a set of fundamental properties rather than an ad hoc collection chosen by the designers.
     395The features in a programming language should be composed of a set of fundamental properties rather than an ad hoc collection chosen by the designers.
    395396To this end, the control-flow features created for \CFA are based on the fundamental properties of any language with function-stack control-flow (see also \uC~\cite[pp.~140-142]{uC++}).
    396 The fundamental properties are execution state, thread, and mutual-exclusion/synchronization (MES).
     397The fundamental properties are execution state, thread, and mutual-exclusion/synchronization.
    397398These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread.
    398399While it is possible for a language to only provide threads for composing programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient solutions to certain classes of problems.
    399400As is shown, each of the non-rejected composed language features solves a particular set of problems, and hence, has a defensible position in a programming language.
    400 If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or is inefficient solution.
     401If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or inefficient solution.
    401402
    402403In detail, the fundamental properties are:
    403404\begin{description}[leftmargin=\parindent,topsep=3pt,parsep=0pt]
    404405\item[\newterm{execution state}:]
    405 is the state information needed by a control-flow feature to initialize, manage compute data and execution location(s), and de-initialize, \eg calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
     406is the state information needed by a control-flow feature to initialize and manage both compute data and execution location(s), and de-initialize.
     407For example, calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
    406408State is retained in fixed-sized aggregate structures (objects) and dynamic-sized stack(s), often allocated in the heap(s) managed by the runtime system.
    407409The lifetime of state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
     
    414416Multiple threads provide \emph{concurrent execution};
    415417concurrent execution becomes parallel when run on multiple processing units, \eg hyper-threading, cores, or sockets.
    416 There must be language mechanisms to create, block and unblock, and join with a thread, even if the mechanism is indirect.
    417 
    418 \item[\newterm{MES}:]
    419 is the concurrency mechanisms to perform an action without interruption and establish timing relationships among multiple threads.
     418A programmer needs mechanisms to create, block and unblock, and join with a thread, even if these basic mechanisms are supplied indirectly through high-level features.
     419
     420\item[\newterm{mutual-exclusion / synchronization (MES)}:]
     421is the concurrency mechanism to perform an action without interruption and establish timing relationships among multiple threads.
    420422We contented these two properties are independent, \ie mutual exclusion cannot provide synchronization and vice versa without introducing additional threads~\cite[\S~4]{Buhr05a}.
    421 Limiting MES, \eg no access to shared data, results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
     423Limiting MES functionality results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
    422424\end{description}
    423 These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without atomic hardware mechanisms.
     425These properties are fundamental as they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without (atomic) hardware mechanisms.
    424426
    425427
     
    443445\renewcommand{\arraystretch}{1.25}
    444446%\setlength{\tabcolsep}{5pt}
     447\vspace*{-5pt}
    445448\begin{tabular}{c|c||l|l}
    446449\multicolumn{2}{c||}{execution properties} & \multicolumn{2}{c}{mutual exclusion / synchronization} \\
     
    461464Yes (stackful)          & Yes           & \textbf{11}\ \ \ @thread@                             & \textbf{12}\ \ @mutex@ @thread@               \\
    462465\end{tabular}
     466\vspace*{-8pt}
    463467\end{table}
    464468
     
    468472A @mutex@ structure, often called a \newterm{monitor}, provides a high-level interface for race-free access of shared data in concurrent programming-languages.
    469473Case 3 is case 1 where the structure can implicitly retain execution state and access functions use this execution state to resume/suspend across \emph{callers}, but resume/suspend does not retain a function's local state.
    470 A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
     474A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrows the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
    471475Generators provide the first step toward directly solving problems like finite-state machines that retain data and execution state between calls, whereas normal functions restart on each call.
    472476Case 4 is cases 2 and 3 with thread safety during execution of the generator's access functions.
     
    475479A stackful generator, often called a \newterm{coroutine}, is \newterm{stackful} because resume/suspend now context switch to/from the caller's and coroutine's stack.
    476480A coroutine extends the state retained between calls beyond the generator's structure to arbitrary call depth in the access functions.
    477 Cases 7 and 8 are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack.
    478 Cases 9 and 10 are rejected because a thread needs a growable stack to accept calls, make calls, block, or be preempted, all of which compound to require an unknown amount of execution state.
    479 If this kind of thread exists, it must execute to completion, \ie computation only, which severely restricts runtime management.
     481Cases 7, 8, 9 and 10 are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack.
     482For cases 9 and 10, the stackless frame is not growable, precluding accepting nested calls, making calls, blocking as it requires calls, or preemption as it requires pushing an interrupt frame, all of which compound to require an unknown amount of execution state.
     483Hence, if this kind of uninterruptable thread exists, it must execute to completion, \ie computation only, which severely restricts runtime management.
    480484Cases 11 and 12 are a stackful thread with and without safe access to shared state.
    481485A thread is the language mechanism to start another thread of control in a program with growable execution state for call/return execution.
     
    13961400The call to @start@ is the first @resume@ of @prod@, which remembers the program main as the starter and creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
    13971401@prod@'s coroutine main starts, creates local-state variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer's @deliver@ function to transfer the values, and printing the status returned from the consumer.
    1398 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
     1402The producer's call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
    13991403Similarly on the first resume, @cons@'s stack is created and initialized, holding local-state variables retained between subsequent activations of the coroutine.
    14001404The symmetric coroutine cycle forms when the consumer calls the producer's @payment@ function, which resumes the producer in the consumer's delivery function.
    14011405When the producer calls @delivery@ again, it resumes the consumer in the @payment@ function.
    1402 Both interface function than return to the their corresponding coroutine-main functions for the next cycle.
     1406Both interface functions then return to their corresponding coroutine-main functions for the next cycle.
    14031407Figure~\ref{f:ProdConsRuntimeStacks} shows the runtime stacks of the program main, and the coroutine mains for @prod@ and @cons@ during the cycling.
    14041408As a consequence of a coroutine retaining its last resumer for suspending back, these reverse pointers allow @suspend@ to cycle \emph{backwards} around a symmetric coroutine cycle.
     
    14141418
    14151419Terminating a coroutine cycle is more complex than a generator cycle, because it requires context switching to the program main's \emph{stack} to shutdown the program, whereas generators started by the program main run on its stack.
    1416 Furthermore, each deallocated coroutine must execute all destructors for object allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep.
     1420Furthermore, each deallocated coroutine must execute all destructors for objects allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep.
    14171421In the example, termination begins with the producer's loop stopping after N iterations and calling the consumer's @stop@ function, which sets the @done@ flag, resumes the consumer in function @payment@, terminating the call, and the consumer's loop in its coroutine main.
    14181422% (Not shown is having @prod@ raise a nonlocal @stop@ exception at @cons@ after it finishes generating values and suspend back to @cons@, which catches the @stop@ exception to terminate its loop.)
     
    14381442if @ping@ ends first, it resumes its starter the program main on return.
    14391443Regardless of the cycle complexity, the starter structure always leads back to the program main, but the path can be entered at an arbitrary point.
    1440 Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, runnning any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack.
    1441 Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation semnatics, regardless of the coroutine's state (terminated or active), like any other aggregate object.
     1444Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, running any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack.
     1445Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation semantics, regardless of the coroutine's state (terminated or active), like any other aggregate object.
    14421446
    14431447
     
    14451449
    14461450A significant implementation challenge for generators and coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg the coroutine stack.
    1447 There are several solutions to these problem, which follow from the object-oriented flavour of adopting custom types.
     1451There are several solutions to this problem, which follow from the object-oriented flavour of adopting custom types.
    14481452
    14491453For object-oriented languages, inheritance is used to provide extra fields and code via explicit inheritance:
     
    14801484forall( `dtype` T | is_coroutine(T) ) void $suspend$( T & ), resume( T & );
    14811485\end{cfa}
    1482 Note, copying generators, coroutines, and threads is undefined because muliple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy.
     1486Note, copying generators, coroutines, and threads is undefined because multiple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy.
    14831487The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer.
    14841488The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle.
     
    16251629        MyThread * team = factory( 10 );
    16261630        // concurrency
    1627         `delete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$
     1631        `adelete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$
    16281632}
    16291633\end{cfa}
     
    17021706Unrestricted nondeterminism is meaningless as there is no way to know when a result is completed and safe to access.
    17031707To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}.
    1704 The shared data protected by mutual exlusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems.
     1708The shared data protected by mutual exclusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems.
    17051709Without synchronization control in a critical section, an arriving thread can barge ahead of preexisting waiter threads resulting in short/long-term starvation, staleness and freshness problems, and incorrect transfer of data.
    17061710Preventing or detecting barging is a challenge with low-level locks, but made easier through higher-level constructs.
     
    18261830\end{cquote}
    18271831The @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference or pointer.
    1828 Similarly, the function definitions ensures there is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data.
     1832Similarly, the function definitions ensure there is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data.
    18291833The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics.
    18301834\CFA relies heavily on traits as an abstraction mechanism, so the @mutex@ qualifier prevents coincidentally matching of a monitor trait with a type that is not a monitor, similar to coincidental inheritance where a shape and playing card can both be drawable.
     
    24792483
    24802484One scheduling solution is for the signaller S to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling.
    2481 However, this solution is inefficient if W2 waited first and can be immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement.
     2485However, this solution is inefficient if W2 waited first and immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement.
    24822486If W1 waited first, the signaller must retain @m1@ amd @m2@ until completion of the outer mutex statement and then pass both to W1.
    24832487% Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves.
    2484 To support this efficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread.
     2488To support these efficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread.
    24852489When a signaller exits or waits in a mutex function or statement, the front waiter on urgent is unblocked if all its monitors are released.
    2486 Implementing a fast subset check for the necessary released monitors is important and discussed in the following sections.
     2490Implementing a fast subset check for the necessarily released monitors is important and discussed in the following sections.
    24872491% The benefit is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met.
    24882492
     
    25432547Hence, function pointers are used to identify the functions listed in the @waitfor@ statement, stored in a variable-sized array.
    25442548Then, the same implementation approach used for the urgent stack (see Section~\ref{s:Scheduling}) is used for the calling queue.
    2545 Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transfers.
     2549Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transferred.
    25462550
    25472551
     
    27782782The \CFA program @main@ uses the call/return paradigm to directly communicate with the @GoRtn main@, whereas Go switches to the unbuffered channel paradigm to indirectly communicate with the goroutine.
    27792783Communication by multiple threads is safe for the @gortn@ thread via mutex calls in \CFA or channel assignment in Go.
    2780 The different between call and channel send occurs for buffered channels making the send asynchronous.
    2781 In \CFA, asynchronous call and multiple buffers is provided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed).
     2784The difference between call and channel send occurs for buffered channels making the send asynchronous.
     2785In \CFA, asynchronous call and multiple buffers are provided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed).
    27822786
    27832787Figure~\ref{f:DirectCommunicationDatingService} shows the dating-service problem in Figure~\ref{f:DatingServiceMonitor} extended from indirect monitor communication to direct thread communication.
    2784 When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does an much work as possible.
     2788When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does as much work as possible.
    27852789Notice, the dating server is postponing requests for an unspecified time while continuing to accept new requests.
    27862790For complex servers, \eg web-servers, there can be hundreds of lines of code in the thread main and safe interaction with clients can be complex.
     
    27902794
    27912795For completeness and efficiency, \CFA provides a standard set of low-level locks: recursive mutex, condition, semaphore, barrier, \etc, and atomic instructions: @fetchAssign@, @fetchAdd@, @testSet@, @compareSet@, \etc.
    2792 Some of these low-level mechanism are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible.
     2796Some of these low-level mechanisms are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible.
    27932797
    27942798
     
    29802984
    29812985To test the performance of the \CFA runtime, a series of microbenchmarks are used to compare \CFA with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
    2982 For comparison, the package must be multi-processor (M:N), which excludes libdil and /libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.
     2986For comparison, the package must be multi-processor (M:N), which excludes libdil and libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.
    29832987The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and pthreads/\CFA/\uC are compiled with gcc 9.2.1.
    29842988
     
    30493053Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
    30503054Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
    3051 Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
     3055Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking.
    30523056
    30533057\begin{multicols}{2}
     
    33083312This type of concurrency can be achieved both at the language level and at the library level.
    33093313The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}.
    3310 The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanism to solve basic HPC data-concurrency problems.
     3314The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanisms to solve basic HPC data-concurrency problems.
    33113315However, implicit concurrency is a restrictive solution with significant limitations, so it can never replace explicit concurrent programming.
    33123316
Note: See TracChangeset for help on using the changeset viewer.