Context Navigation

← Previous Change
Next Change →

Paper.tex

Timestamp:

Jun 24, 2020, 5:00:59 PM (4 years ago)

Author:

Thierry Delisle <tdelisle@…>

Branches:

ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum

Children:

c953163

Parents:

9791ab5 (diff), 7f9968a (diff)
Note: this is a merge changeset, the changes displayed below correspond to the merge itself.
Use the (diff) links above to see all the changes relative to each parent.

Message:

Merge branch 'master' into relaxed_ready

File:

: 1 edited

doc/papers/concurrency/Paper.tex (modified) (23 diffs)

Legend:

: Unmodified
: Added
: Removed

doc/papers/concurrency/Paper.tex

-                      r9791ab5
+                      r8b58bae
 \CFA~\cite{Moss18,Cforall} is a modern, polymorphic, non-object-oriented\footnote{
 \CFA has object-oriented features, such as constructors, destructors, virtuals and simple trait/interface inheritance.
+\CFA has object-oriented features, such as constructors, destructors, and simple trait/interface inheritance.
 % Go interfaces, Rust traits, Swift Protocols, Haskell Type Classes and Java Interfaces.
 % "Trait inheritance" works for me. "Interface inheritance" might also be a good choice, and distinguish clearly from implementation inheritance.
+% You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code. Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type".
+However, functions \emph{cannot} be nested in structures, so there is no lexical binding between a structure and set of functions implemented by an implicit \lstinline@this@ (receiver) parameter.},
+% You'll want to be a little bit careful with terms like "structural" and "nominal" inheritance as well. CFA has structural inheritance (I think Go as well) -- it's inferred based on the structure of the code.
+% Java, Rust, and Haskell (not sure about Swift) have nominal inheritance, where there needs to be a specific statement that "this type inherits from this type".
+However, functions \emph{cannot} be nested in structures and there is no mechanism to designate a function parameter as a receiver, \lstinline@this@, parameter.},
 backwards-compatible extension of the C programming language.
 In many ways, \CFA is to C as Scala~\cite{Scala} is to Java, providing a vehicle for new typing and control-flow capabilities on top of a highly popular programming language\footnote{
 …
 Coroutines are only a stepping stone towards concurrency where the commonality is that coroutines and threads retain state between calls.
 \Celeven/\CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone}
+\Celeven and \CCeleven define concurrency~\cite[\S~7.26]{C11}, but it is largely wrappers for a subset of the pthreads library~\cite{Pthreads}.\footnote{Pthreads concurrency is based on simple thread fork and join in a function and mutex or condition locks, which is low-level and error-prone}
 Interestingly, almost a decade after the \Celeven standard, the most recent versions of gcc, clang, and msvc do not support the \Celeven include @threads.h@, indicating no interest in the C11 concurrency approach (possibly because of the recent effort to add concurrency to \CC).
 While the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}, as for \CC.
 …
 \label{s:FundamentalExecutionProperties}
 The features in a programming language should be composed from a set of fundamental properties rather than an ad hoc collection chosen by the designers.
+The features in a programming language should be composed of a set of fundamental properties rather than an ad hoc collection chosen by the designers.
 To this end, the control-flow features created for \CFA are based on the fundamental properties of any language with function-stack control-flow (see also \uC~\cite[pp.~140-142]{uC++}).
 The fundamental properties are execution state, thread, and mutual-exclusion/synchronization (MES).
+The fundamental properties are execution state, thread, and mutual-exclusion/synchronization.
 These independent properties can be used to compose different language features, forming a compositional hierarchy, where the combination of all three is the most advanced feature, called a thread.
 While it is possible for a language to only provide threads for composing programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient solutions to certain classes of problems.
 As is shown, each of the non-rejected composed language features solves a particular set of problems, and hence, has a defensible position in a programming language.
 If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or is inefficient solution.
+If a compositional feature is missing, a programmer has too few fundamental properties resulting in a complex and/or inefficient solution.
 In detail, the fundamental properties are:
 \begin{description}[leftmargin=\parindent,topsep=3pt,parsep=0pt]
 \item[\newterm{execution state}:]
+is the state information needed by a control-flow feature to initialize, manage compute data and execution location(s), and de-initialize, \eg calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
+is the state information needed by a control-flow feature to initialize and manage both compute data and execution location(s), and de-initialize.
+For example, calling a function initializes a stack frame including contained objects with constructors, manages local data in blocks and return locations during calls, and de-initializes the frame by running any object destructors and management operations.
 State is retained in fixed-sized aggregate structures (objects) and dynamic-sized stack(s), often allocated in the heap(s) managed by the runtime system.
 The lifetime of state varies with the control-flow feature, where longer life-time and dynamic size provide greater power but also increase usage complexity and cost.
 …
 Multiple threads provide \emph{concurrent execution};
 concurrent execution becomes parallel when run on multiple processing units, \eg hyper-threading, cores, or sockets.
 There must be language mechanisms to create, block and unblock, and join with a thread, even if the mechanism is indirect.
 \item[\newterm{MES}:]
 is the concurrency mechanisms to perform an action without interruption and establish timing relationships among multiple threads.
+A programmer needs mechanisms to create, block and unblock, and join with a thread, even if these basic mechanisms are supplied indirectly through high-level features.
+\item[\newterm{mutual-exclusion / synchronization (MES)}:]
+is the concurrency mechanism to perform an action without interruption and establish timing relationships among multiple threads.
 We contented these two properties are independent, \ie mutual exclusion cannot provide synchronization and vice versa without introducing additional threads~\cite[\S~4]{Buhr05a}.
 Limiting MES, \eg no access to shared data, results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
+Limiting MES functionality results in contrived solutions and inefficiency on multi-core von Neumann computers where shared memory is a foundational aspect of its design.
 \end{description}
 These properties are fundamental because they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without atomic hardware mechanisms.
+These properties are fundamental as they cannot be built from existing language features, \eg a basic programming language like C99~\cite{C99} cannot create new control-flow features, concurrency, or provide MES without (atomic) hardware mechanisms.
 …
 \renewcommand{\arraystretch}{1.25}
 %\setlength{\tabcolsep}{5pt}
+\vspace*{-5pt}
 \begin{tabular}{c|c||l|l}
 \multicolumn{2}{c||}{execution properties} & \multicolumn{2}{c}{mutual exclusion / synchronization} \\
 …
 Yes (stackful)          & Yes           & \textbf{11}\ \ \ @thread@                             & \textbf{12}\ \ @mutex@ @thread@               \\
 \end{tabular}
+\vspace*{-8pt}
 \end{table}
 …
 A @mutex@ structure, often called a \newterm{monitor}, provides a high-level interface for race-free access of shared data in concurrent programming-languages.
 Case 3 is case 1 where the structure can implicitly retain execution state and access functions use this execution state to resume/suspend across \emph{callers}, but resume/suspend does not retain a function's local state.
 A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrow the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
+A stackless structure, often called a \newterm{generator} or \emph{iterator}, is \newterm{stackless} because it still borrows the caller's stack and thread, but the stack is used only to preserve state across its callees not callers.
 Generators provide the first step toward directly solving problems like finite-state machines that retain data and execution state between calls, whereas normal functions restart on each call.
 Case 4 is cases 2 and 3 with thread safety during execution of the generator's access functions.
 …
 A stackful generator, often called a \newterm{coroutine}, is \newterm{stackful} because resume/suspend now context switch to/from the caller's and coroutine's stack.
 A coroutine extends the state retained between calls beyond the generator's structure to arbitrary call depth in the access functions.
 Cases 7 and 8 are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack.
 Cases 9 and 10 are rejected because a thread needs a growable stack to accept calls, make calls, block, or be preempted, all of which compound to require an unknown amount of execution state.
 If this kind of thread exists, it must execute to completion, \ie computation only, which severely restricts runtime management.
+Cases 7, 8, 9 and 10 are rejected because a new thread must have its own stack, where the thread begins and stack frames are stored for calls, \ie it is unrealistic for a thread to borrow a stack.
+For cases 9 and 10, the stackless frame is not growable, precluding accepting nested calls, making calls, blocking as it requires calls, or preemption as it requires pushing an interrupt frame, all of which compound to require an unknown amount of execution state.
+Hence, if this kind of uninterruptable thread exists, it must execute to completion, \ie computation only, which severely restricts runtime management.
 Cases 11 and 12 are a stackful thread with and without safe access to shared state.
 A thread is the language mechanism to start another thread of control in a program with growable execution state for call/return execution.
 …
 The call to @start@ is the first @resume@ of @prod@, which remembers the program main as the starter and creates @prod@'s stack with a frame for @prod@'s coroutine main at the top, and context switches to it.
 @prod@'s coroutine main starts, creates local-state variables that are retained between coroutine activations, and executes $N$ iterations, each generating two random values, calling the consumer's @deliver@ function to transfer the values, and printing the status returned from the consumer.
 The producer call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
+The producer's call to @delivery@ transfers values into the consumer's communication variables, resumes the consumer, and returns the consumer status.
 Similarly on the first resume, @cons@'s stack is created and initialized, holding local-state variables retained between subsequent activations of the coroutine.
 The symmetric coroutine cycle forms when the consumer calls the producer's @payment@ function, which resumes the producer in the consumer's delivery function.
 When the producer calls @delivery@ again, it resumes the consumer in the @payment@ function.
 Both interface function than return to the their corresponding coroutine-main functions for the next cycle.
+Both interface functions then return to their corresponding coroutine-main functions for the next cycle.
 Figure~\ref{f:ProdConsRuntimeStacks} shows the runtime stacks of the program main, and the coroutine mains for @prod@ and @cons@ during the cycling.
 As a consequence of a coroutine retaining its last resumer for suspending back, these reverse pointers allow @suspend@ to cycle \emph{backwards} around a symmetric coroutine cycle.
 …
 Terminating a coroutine cycle is more complex than a generator cycle, because it requires context switching to the program main's \emph{stack} to shutdown the program, whereas generators started by the program main run on its stack.
 Furthermore, each deallocated coroutine must execute all destructors for object allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep.
+Furthermore, each deallocated coroutine must execute all destructors for objects allocated in the coroutine type \emph{and} allocated on the coroutine's stack at the point of suspension, which can be arbitrarily deep.
 In the example, termination begins with the producer's loop stopping after N iterations and calling the consumer's @stop@ function, which sets the @done@ flag, resumes the consumer in function @payment@, terminating the call, and the consumer's loop in its coroutine main.
 % (Not shown is having @prod@ raise a nonlocal @stop@ exception at @cons@ after it finishes generating values and suspend back to @cons@, which catches the @stop@ exception to terminate its loop.)
 …
 if @ping@ ends first, it resumes its starter the program main on return.
 Regardless of the cycle complexity, the starter structure always leads back to the program main, but the path can be entered at an arbitrary point.
 Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, runnning any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack.
 Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation semnatics, regardless of the coroutine's state (terminated or active), like any other aggregate object.
+Once back at the program main (creator), coroutines @ping@ and @pong@ are deallocated, running any destructors for objects within the coroutine and possibly deallocating any coroutine stacks for non-terminated coroutines, where stack deallocation implies stack unwinding to find destructors for allocated objects on the stack.
+Hence, the \CFA termination semantics for the generator and coroutine ensure correct deallocation semantics, regardless of the coroutine's state (terminated or active), like any other aggregate object.
 …
 A significant implementation challenge for generators and coroutines (and threads in Section~\ref{s:threads}) is adding extra fields to the custom types and related functions, \eg inserting code after/before the coroutine constructor/destructor and @main@ to create/initialize/de-initialize/destroy any extra fields, \eg the coroutine stack.
 There are several solutions to these problem, which follow from the object-oriented flavour of adopting custom types.
+There are several solutions to this problem, which follow from the object-oriented flavour of adopting custom types.
 For object-oriented languages, inheritance is used to provide extra fields and code via explicit inheritance:
 …
 forall( `dtype` T | is_coroutine(T) ) void $suspend$( T & ), resume( T & );
 \end{cfa}
 Note, copying generators, coroutines, and threads is undefined because muliple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy.
+Note, copying generators, coroutines, and threads is undefined because multiple objects cannot execute on a shared stack and stack copying does not work in unmanaged languages (no garbage collection), like C, because the stack may contain pointers to objects within it that require updating for the copy.
 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer.
 The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle.
 …
         MyThread * team = factory( 10 );
         // concurrency
         `delete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$
+        `adelete( team );` $\C{// deallocate heap-based threads, implicit joins before destruction}\CRT$
+}
 \end{cfa}
 …
 Unrestricted nondeterminism is meaningless as there is no way to know when a result is completed and safe to access.
 To produce meaningful execution requires clawing back some determinism using mutual exclusion and synchronization, where mutual exclusion provides access control for threads using shared data, and synchronization is a timing relationship among threads~\cite[\S~4]{Buhr05a}.
 The shared data protected by mutual exlusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems.
+The shared data protected by mutual exclusion is called a \newterm{critical section}~\cite{Dijkstra65}, and the protection can be simple, only 1 thread, or complex, only N kinds of threads, \eg group~\cite{Joung00} or readers/writer~\cite{Courtois71} problems.
 Without synchronization control in a critical section, an arriving thread can barge ahead of preexisting waiter threads resulting in short/long-term starvation, staleness and freshness problems, and incorrect transfer of data.
 Preventing or detecting barging is a challenge with low-level locks, but made easier through higher-level constructs.
 …
 \end{cquote}
 The @dtype@ property prevents \emph{implicit} copy operations and the @is_monitor@ trait provides no \emph{explicit} copy operations, so monitors must be passed by reference or pointer.
 Similarly, the function definitions ensures there is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data.
+Similarly, the function definitions ensure there is a mechanism to read the monitor descriptor from its handle, and a special destructor to prevent deallocation if a thread is using the shared data.
 The custom monitor type also inserts any locks needed to implement the mutual exclusion semantics.
 \CFA relies heavily on traits as an abstraction mechanism, so the @mutex@ qualifier prevents coincidentally matching of a monitor trait with a type that is not a monitor, similar to coincidental inheritance where a shape and playing card can both be drawable.
 …
 One scheduling solution is for the signaller S to keep ownership of all locks until the last lock is ready to be transferred, because this semantics fits most closely to the behaviour of single-monitor scheduling.
 However, this solution is inefficient if W2 waited first and can be immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement.
+However, this solution is inefficient if W2 waited first and immediate passed @m2@ when released, while S retains @m1@ until completion of the outer mutex statement.
 If W1 waited first, the signaller must retain @m1@ amd @m2@ until completion of the outer mutex statement and then pass both to W1.
 % Furthermore, there is an execution sequence where the signaller always finds waiter W2, and hence, waiter W1 starves.
 To support this efficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread.
+To support these efficient semantics and prevent barging, the implementation maintains a list of monitors acquired for each blocked thread.
 When a signaller exits or waits in a mutex function or statement, the front waiter on urgent is unblocked if all its monitors are released.
 Implementing a fast subset check for the necessary released monitors is important and discussed in the following sections.
+Implementing a fast subset check for the necessarily released monitors is important and discussed in the following sections.
 % The benefit is encapsulating complexity into only two actions: passing monitors to the next owner when they should be released and conditionally waking threads if all conditions are met.
 …
 Hence, function pointers are used to identify the functions listed in the @waitfor@ statement, stored in a variable-sized array.
 Then, the same implementation approach used for the urgent stack (see Section~\ref{s:Scheduling}) is used for the calling queue.
 Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transfers.
+Each caller has a list of monitors acquired, and the @waitfor@ statement performs a short linear search matching functions in the @waitfor@ list with called functions, and then verifying the associated mutex locks can be transferred.
 …
 The \CFA program @main@ uses the call/return paradigm to directly communicate with the @GoRtn main@, whereas Go switches to the unbuffered channel paradigm to indirectly communicate with the goroutine.
 Communication by multiple threads is safe for the @gortn@ thread via mutex calls in \CFA or channel assignment in Go.
 The different between call and channel send occurs for buffered channels making the send asynchronous.
 In \CFA, asynchronous call and multiple buffers is provided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed).
+The difference between call and channel send occurs for buffered channels making the send asynchronous.
+In \CFA, asynchronous call and multiple buffers are provided using an administrator and worker threads~\cite{Gentleman81} and/or futures (not discussed).
 Figure~\ref{f:DirectCommunicationDatingService} shows the dating-service problem in Figure~\ref{f:DatingServiceMonitor} extended from indirect monitor communication to direct thread communication.
 When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does an much work as possible.
+When converting a monitor to a thread (server), the coding pattern is to move as much code as possible from the accepted functions into the thread main so it does as much work as possible.
 Notice, the dating server is postponing requests for an unspecified time while continuing to accept new requests.
 For complex servers, \eg web-servers, there can be hundreds of lines of code in the thread main and safe interaction with clients can be complex.
 …
 For completeness and efficiency, \CFA provides a standard set of low-level locks: recursive mutex, condition, semaphore, barrier, \etc, and atomic instructions: @fetchAssign@, @fetchAdd@, @testSet@, @compareSet@, \etc.
 Some of these low-level mechanism are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible.
+Some of these low-level mechanisms are used to build the \CFA runtime, but we always advocate using high-level mechanisms whenever possible.
 …
 To test the performance of the \CFA runtime, a series of microbenchmarks are used to compare \CFA with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
 For comparison, the package must be multi-processor (M:N), which excludes libdil and /libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.
+For comparison, the package must be multi-processor (M:N), which excludes libdil and libmil~\cite{libdill} (M:1)), and use a shared-memory programming model, \eg not message passing.
 The benchmark computer is an AMD Opteron\texttrademark\ 6380 NUMA 64-core, 8 socket, 2.5 GHz processor, running Ubuntu 16.04.6 LTS, and pthreads/\CFA/\uC are compiled with gcc 9.2.1.
 …
 Figure~\ref{f:schedint} shows the code for \CFA, with results in Table~\ref{t:schedint}.
 Note, the incremental cost of bulk acquire for \CFA, which is largely a fixed cost for small numbers of mutex objects.
 Java scheduling is significantly greater because the benchmark explicitly creates multiple thread in order to prevent the JIT from making the program sequential, \ie removing all locking.
+Java scheduling is significantly greater because the benchmark explicitly creates multiple threads in order to prevent the JIT from making the program sequential, \ie removing all locking.
 \begin{multicols}{2}
 …
 This type of concurrency can be achieved both at the language level and at the library level.
 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}.
 The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanism to solve basic HPC data-concurrency problems.
+The \CFA language features should make it possible to develop a reasonable number of implicit concurrency mechanisms to solve basic HPC data-concurrency problems.
 However, implicit concurrency is a restrictive solution with significant limitations, so it can never replace explicit concurrent programming.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 8b58bae for doc/papers/concurrency/Paper.tex

Legend:

doc/papers/concurrency/Paper.tex

Download in other formats: