Ignore:
Timestamp:
Jun 6, 2020, 4:59:19 PM (3 years ago)
Author:
Peter A. Buhr <pabuhr@…>
Branches:
arm-eh, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum
Children:
393d59a
Parents:
9246ec6
Message:

update concurrency paper to address referee comments and generate responses to comments

File:
1 edited

Legend:

Unmodified
Added
Removed
  • doc/papers/concurrency/Paper.tex

    r9246ec6 r27125d0  
    386386Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
    387387Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads).
    388 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
     388Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js v12.18.0, and \uC 7.0.0.
    389389
    390390
     
    893893With respect to safety, we believe static analysis can discriminate persistent generator state from temporary generator-main state and raise a compile-time error for temporary usage spanning suspend points.
    894894Our experience using generators is that the problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small.
    895 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small, fast generators.
     895As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small and fast generators.
    896896
    897897Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored.
     
    915915The destructor provides a newline, if formatted text ends with a full line.
    916916Figure~\ref{f:CFormatGenImpl} shows the C implementation of the \CFA input generator with one additional field and the computed @goto@.
    917 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the format generator.
     917For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the \CFA format generator.
    918918
    919919% https://dl-acm-org.proxy.lib.uwaterloo.ca/
    920920
    921 Figure~\ref{f:DeviceDriverGen} shows an important application for an asymmetric generator, a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}
     921An important application for the asymmetric generator is a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}
    922922Swift \etal~\cite[p.~86]{Swift05} restructure device drivers using the Extension Procedure Call (XPC) within the kernel via functions @nooks_driver_call@ and @nooks_kernel_call@, which have coroutine properties context switching to separate stacks with explicit hand-off calls;
    923923however, the calls do not retain execution state, and hence always start from the top.
     
    925925However, Adya \etal~\cite{Adya02} argue against stack ripping in Section 3.2 and suggest a hybrid approach in Section 4 using cooperatively scheduled \emph{fibers}, which is coroutining.
    926926
    927 As an example, the following protocol:
     927Figure~\ref{f:DeviceDriverGen} shows the generator advantages in implementing a simple network device-driver with the following protocol:
    928928\begin{center}
    929929\ldots\, STX \ldots\, message \ldots\, ESC ETX \ldots\, message \ldots\, ETX 2-byte crc \ldots
    930930\end{center}
    931 is for a simple network message beginning with the control character STX, ending with an ETX, and followed by a 2-byte cyclic-redundancy check.
     931where the network message begins with the control character STX, ends with an ETX, and is followed by a 2-byte cyclic-redundancy check.
    932932Control characters may appear in a message if preceded by an ESC.
    933933When a message byte arrives, it triggers an interrupt, and the operating system services the interrupt by calling the device driver with the byte read from a hardware register.
    934 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system read the message accumulated in the supplied buffer.
     934The device driver returns a status code of its current state, and when a complete message is obtained, the operating system reads the message accumulated in the supplied buffer.
    935935Hence, the device driver is an input/output generator, where the cost of resuming the device-driver generator is the same as call and return, so performance in an operating-system kernel is excellent.
    936936The key benefits of using a generator are correctness, safety, and maintenance because the execution states are transcribed directly into the programming language rather than table lookup or stack ripping.
    937 The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.
     937% The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.
    938938
    939939\begin{figure}
     
    992992\end{figure}
    993993
    994 Figure~\ref{f:CFAPingPongGen} shows a symmetric generator, where the generator resumes another generator, forming a resume/resume cycle.
     994Generators can also have symmetric activation using resume/resume to create control-flow cycles among generators.
    995995(The trivial cycle is a generator resuming itself.)
    996996This control flow is similar to recursion for functions but without stack growth.
    997 Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow are creating, executing, and terminating the cycle.
     997Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow using for the ping/pong program in Figure~\ref{f:CFAPingPongGen}.
     998The program starts by creating the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.
    998999Constructing the cycle must deal with definition-before-use to close the cycle, \ie, the first generator must know about the last generator, which is not within scope.
    9991000(This issue occurs for any cyclic data structure.)
    1000 The example creates the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.
    10011001% (Alternatively, the constructor can assign the partners as they are declared, except the first, and the first-generator partner is set after the last generator declaration to close the cycle.)
    1002 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary cycle using @resume@ to activate partner generator(s).
     1002Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary number of cycles using @resume@ to activate partner generator(s).
    10031003Terminating the cycle is accomplished by @suspend@ or @return@, both of which go back to the stack frame that started the cycle (program main in the example).
    10041004Note, the creator and starter may be different, \eg if the creator calls another function that starts the cycle.
     
    10061006Also, since local variables are not retained in the generator function, there are no objects with destructors to be called, so the cost is the same as a function return.
    10071007Destructor cost occurs when the generator instance is deallocated by the creator.
     1008
     1009\begin{figure}
     1010\centering
     1011\input{FullCoroutinePhases.pstex_t}
     1012\vspace*{-10pt}
     1013\caption{Symmetric coroutine steps: Ping / Pong}
     1014\label{f:PingPongFullCoroutineSteps}
     1015\end{figure}
    10081016
    10091017\begin{figure}
     
    10691077\end{figure}
    10701078
    1071 \begin{figure}
    1072 \centering
    1073 \input{FullCoroutinePhases.pstex_t}
    1074 \vspace*{-10pt}
    1075 \caption{Symmetric coroutine steps: Ping / Pong}
    1076 \label{f:PingPongFullCoroutineSteps}
    1077 \end{figure}
    1078 
    10791079Figure~\ref{f:CPingPongSim} shows the C implementation of the \CFA symmetric generator, where there is still only one additional field, @restart@, but @resume@ is more complex because it does a forward rather than backward jump.
    10801080Before the jump, the parameter for the next call @partner@ is placed into the register used for the first parameter, @rdi@, and the remaining registers are reset for a return.
     
    11001100\label{s:Coroutine}
    11011101
    1102 Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main.
    1103 A coroutine is specified by replacing @generator@ with @coroutine@ for the type.
     1102Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics with an implicit closure and @suspend@ may appear in a helper function called from the coroutine main because of the separate stack.
     1103Note, simulating coroutines with stacks of generators, \eg Python with @yield from@ cannot handle symmetric control-flow.
     1104Furthermore, all stack components must be of generators, so it is impossible to call a library function passing a generator that yields.
     1105Creating a generator copy of the library function maybe impossible because the library function is opaque.
     1106
     1107A \CFA coroutine is specified by replacing @generator@ with @coroutine@ for the type.
    11041108Coroutine generality results in higher cost for creation, due to dynamic stack allocation, for execution, due to context switching among stacks, and for terminating, due to possible stack unwinding and dynamic stack deallocation.
    11051109A series of different kinds of coroutines and their implementations demonstrate how coroutines extend generators.
    11061110
    11071111First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main.
     1112Now the coroutine type only contains communication variables between interface functions and the coroutine main.
    11081113\begin{center}
    11091114\begin{tabular}{@{}l|l|l|l@{}}
     
    11471152}
    11481153\end{cfa}
    1149 A call to this function is placed at the end of the driver's coroutine-main.
     1154A call to this function is placed at the end of the device driver's coroutine-main.
    11501155For complex finite-state machines, refactoring is part of normal program abstraction, especially when code is used in multiple places.
    11511156Again, this complexity is usually associated with execution state rather than data state.
     
    14781483The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer.
    14791484The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle.
    1480 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input and output values versus fixed ones.
     1485The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with arbitrary typed input and output values versus fixed ones.
    14811486The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@.
    14821487
     
    15231528The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate.
    15241529The coroutine stack can appear in a number of locations and be fixed or variable sized.
    1525 Hence, the coroutine's stack could be a variable-length structure (VLS)\footnote{
    1526 We are examining VLSs, where fields can be variable-sized structures or arrays.
    1527 Once allocated, a VLS is fixed sized.}
     1530Hence, the coroutine's stack could be a variable-length structure (VLS)
     1531% \footnote{
     1532% We are examining VLSs, where fields can be variable-sized structures or arrays.
     1533% Once allocated, a VLS is fixed sized.}
    15281534on the allocating stack, provided the allocating stack is large enough.
    15291535For a VLS stack allocation and deallocation is an inexpensive adjustment of the stack pointer, modulo any stack constructor costs to initial frame setup.
     
    15701576\label{s:threads}
    15711577
    1572 Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion.
    1573 A common API for this ability is @fork@ and @join@.
     1578Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion, where a common API is @fork@ and @join@.
    15741579\vspace{4pt}
    15751580\par\noindent
     
    17251730For these reasons, \CFA selected monitors as the core high-level concurrency construct, upon which higher-level approaches can be easily constructed.
    17261731
    1727 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter.\footnote{
    1728 Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.}
     1732Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter.
     1733(Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.)
    17291734A \newterm{monitor} is a set of functions that ensure mutual exclusion when accessing shared state.
    17301735(Note, in \CFA, @monitor@ is short-hand for @mutex struct@.)
     
    29862991The total time is divided by @N@ to obtain the average time for a benchmark.
    29872992Each benchmark experiment is run 13 times and the average appears in the table.
    2988 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallBenchMarks}.
     2993All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}.
    29892994% tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark
     2995% cp -p benchmark.tar /u/cforall/public_html/doc/concurrent_benchmark.tar
    29902996
    29912997\paragraph{Creation}
     
    30283034\uC thread                              & 523.4         & 523.9         & 7.7           \\
    30293035Python generator                & 123.2         & 124.3         & 4.1           \\
    3030 Node.js generator               & 32.3          & 32.2          & 0.3           \\
     3036Node.js generator               & 33.4          & 33.5          & 0.3           \\
    30313037Goroutine thread                & 751.0         & 750.5         & 3.1           \\
    30323038Rust tokio thread               & 1860.0        & 1881.1        & 37.6          \\
     
    32333239Python generator        & 40.9          & 41.3          & 1.5   \\
    32343240Node.js await           & 1852.2        & 1854.7        & 16.4  \\
    3235 Node.js generator       & 32.6          & 32.2          & 1.0   \\
     3241Node.js generator       & 33.3          & 33.4          & 0.3   \\
    32363242Goroutine thread        & 143.0         & 143.3         & 1.1   \\
    32373243Rust async await        & 32.0          & 32.0          & 0.0   \\
     
    32713277While control flow in \CFA has a strong start, development is still underway to complete a number of missing features.
    32723278
    3273 \vspace{-5pt}
    3274 \paragraph{Flexible Scheduling}
    3275 \label{futur:sched}
    3276 
     3279\medskip
     3280\textbf{Flexible Scheduling:}
    32773281An important part of concurrency is scheduling.
    32783282Different scheduling algorithms can affect performance, both in terms of average and variation.
     
    32823286Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion~\cite{Buhr00b}.
    32833287
    3284 \vspace{-5pt}
    3285 \paragraph{Non-Blocking I/O}
    3286 \label{futur:nbio}
     3288\smallskip
     3289\textbf{Non-Blocking I/O:}
    32873290Many modern workloads are not bound by computation but IO operations, common cases being web servers and XaaS~\cite{XaaS} (anything as a service).
    32883291These types of workloads require significant engineering to amortizing costs of blocking IO-operations.
     
    32933296A non-blocking I/O library is currently under development for \CFA.
    32943297
    3295 \vspace{-5pt}
    3296 \paragraph{Other Concurrency Tools}
    3297 \label{futur:tools}
    3298 
     3298\smallskip
     3299\textbf{Other Concurrency Tools:}
    32993300While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package.
    33003301Examples of such tools can include futures and promises~\cite{promises}, executors and actors.
     
    33023303As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks.
    33033304
    3304 \vspace{-5pt}
    3305 \paragraph{Implicit Threading}
    3306 \label{futur:implcit}
    3307 
    3308 Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, possibly with some help from pragmas to guide the conversion.
     3305\smallskip
     3306\textbf{Implicit Threading:}
     3307Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, with some help from pragmas to guide the conversion.
    33093308This type of concurrency can be achieved both at the language level and at the library level.
    33103309The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}.
Note: See TracChangeset for help on using the changeset viewer.