Context Navigation

← Previous Changeset
Next Changeset →

Changeset 27125d0

Timestamp:

Jun 6, 2020, 4:59:19 PM (5 years ago)

Author:

Peter A. Buhr <pabuhr@…>

Branches:

ADT, arm-eh, ast-experimental, enum, forall-pointer-decay, jacob/cs343-translation, master, new-ast, new-ast-unique-expr, pthread-emulation, qualifiedEnum

Children:

Parents:

Message:

update concurrency paper to address referee comments and generate responses to comments

Location:

Files:

: 4 edited

bibliography/pl.bib (modified) (2 diffs)
papers/AMA/AMA-stix/ama/WileyNJD-v2.cls (modified) (1 diff)
papers/concurrency/Paper.tex (modified) (20 diffs)
papers/concurrency/response2 (modified) (8 diffs)

Legend:

: Unmodified
: Added
: Removed

TabularUnified doc/bibliography/pl.bib ¶

-                      r9246ec6
+                      r27125d0
+}
 @misc{CforallBenchMarks,
+@misc{CforallConcurrentBenchmarks,
     contributer = {pabuhr@plg},
     key         = {Cforall Benchmarks},
     author      = {{\textsf{C}{$\mathbf{\forall}$} Benchmarks}},
     howpublished= {\href{https://plg.uwaterloo.ca/~cforall/benchmark.tar}{https://\-plg.uwaterloo.ca/\-$\sim$cforall/\-benchmark.tar}},
+    howpublished= {\href{https://plg.uwaterloo.ca/~cforall/doc/CforallConcurrentBenchmarks.tar}{https://\-plg.uwaterloo.ca/\-$\sim$cforall/\-doc/\-CforallConcurrentBenchmarks.tar}},
+}
 …
     author      = {Adya, Atul and Howell, Jon and Theimer, Marvin and Bolosky, William J. and Douceur, John R.},
     title       = {Cooperative Task Management Without Manual Stack Management},
     booktitle   = {Proceedings of the General Track of the Annual Conference on USENIX Annual Technical Conference},
+    booktitle   = {Proc. of the General Track USENIX Tech. Conf.},
     series      = {ATEC '02},
     year        = {2002},

TabularUnified doc/papers/AMA/AMA-stix/ama/WileyNJD-v2.cls ¶

-                      r9246ec6
+                      r27125d0
      \@afterheading}
 \renewcommand\section{\@startsection{section}{1}{\z@}{-20pt \@plus -2pt \@minus -2pt}{8\p@}{\sectionfont}}%
+\renewcommand\section{\@startsection{section}{1}{\z@}{-20pt \@plus -2pt \@minus -2pt}{7\p@}{\sectionfont}}%
 \renewcommand\subsection{\@startsection{subsection}{2}{\z@}{-18pt \@plus -2pt \@minus -2pt}{5\p@}{\subsectionfont}}%
 \renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}{-20pt \@plus -2pt \@minus -2pt}{2\p@}{\subsubsectionfont}}%
+\renewcommand\subsubsection{\@startsection{subsubsection}{3}{\z@}{-16pt \@plus -2pt \@minus -2pt}{2\p@}{\subsubsectionfont}}%
+%
 \newskip\secruleskip\secruleskip8.5\p@%

TabularUnified doc/papers/concurrency/Paper.tex ¶

-                      r9246ec6
+                      r27125d0
 Section~\ref{s:Monitor} shows how both mutual exclusion and synchronization are safely embedded in the @monitor@ and @thread@ constructs.
 Section~\ref{s:CFARuntimeStructure} describes the large-scale mechanism to structure threads and virtual processors (kernel threads).
 Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js 12.14.1, and \uC 7.0.0.
+Section~\ref{s:Performance} uses microbenchmarks to compare \CFA threading with pthreads, Java 11.0.6, Go 1.12.6, Rust 1.37.0, Python 3.7.6, Node.js v12.18.0, and \uC 7.0.0.
 …
 With respect to safety, we believe static analysis can discriminate persistent generator state from temporary generator-main state and raise a compile-time error for temporary usage spanning suspend points.
 Our experience using generators is that the problems have simple data state, including local state, but complex execution state, so the burden of creating the generator type is small.
 As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small, fast generators.
+As well, C programmers are not afraid of this kind of semantic programming requirement, if it results in very small and fast generators.
 Figure~\ref{f:CFAFormatGen} shows an asymmetric \newterm{input generator}, @Fmt@, for restructuring text into groups of characters of fixed-size blocks, \ie the input on the left is reformatted into the output on the right, where newlines are ignored.
 …
 The destructor provides a newline, if formatted text ends with a full line.
 Figure~\ref{f:CFormatGenImpl} shows the C implementation of the \CFA input generator with one additional field and the computed @goto@.
 For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the format generator.
+For contrast, Figure~\ref{f:PythonFormatter} shows the equivalent Python format generator with the same properties as the \CFA format generator.
 % https://dl-acm-org.proxy.lib.uwaterloo.ca/
 Figure~\ref{f:DeviceDriverGen} shows an important application for an asymmetric generator, a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}
+An important application for the asymmetric generator is a device-driver, because device drivers are a significant source of operating-system errors: 85\% in Windows XP~\cite[p.~78]{Swift05} and 51.6\% in Linux~\cite[p.~1358,]{Xiao19}. %\cite{Palix11}
 Swift \etal~\cite[p.~86]{Swift05} restructure device drivers using the Extension Procedure Call (XPC) within the kernel via functions @nooks_driver_call@ and @nooks_kernel_call@, which have coroutine properties context switching to separate stacks with explicit hand-off calls;
 however, the calls do not retain execution state, and hence always start from the top.
 …
 However, Adya \etal~\cite{Adya02} argue against stack ripping in Section 3.2 and suggest a hybrid approach in Section 4 using cooperatively scheduled \emph{fibers}, which is coroutining.
 As an example, the following protocol:
+Figure~\ref{f:DeviceDriverGen} shows the generator advantages in implementing a simple network device-driver with the following protocol:
 \begin{center}
 \ldots\, STX \ldots\, message \ldots\, ESC ETX \ldots\, message \ldots\, ETX 2-byte crc \ldots
 \end{center}
 is for a simple network message beginning with the control character STX, ending with an ETX, and followed by a 2-byte cyclic-redundancy check.
+where the network message begins with the control character STX, ends with an ETX, and is followed by a 2-byte cyclic-redundancy check.
 Control characters may appear in a message if preceded by an ESC.
 When a message byte arrives, it triggers an interrupt, and the operating system services the interrupt by calling the device driver with the byte read from a hardware register.
 The device driver returns a status code of its current state, and when a complete message is obtained, the operating system read the message accumulated in the supplied buffer.
+The device driver returns a status code of its current state, and when a complete message is obtained, the operating system reads the message accumulated in the supplied buffer.
 Hence, the device driver is an input/output generator, where the cost of resuming the device-driver generator is the same as call and return, so performance in an operating-system kernel is excellent.
 The key benefits of using a generator are correctness, safety, and maintenance because the execution states are transcribed directly into the programming language rather than table lookup or stack ripping.
 The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.
+% The conclusion is that FSMs are complex and occur in important domains, so direct generator support is important in a system programming language.
 \begin{figure}
 …
 \end{figure}
 Figure~\ref{f:CFAPingPongGen} shows a symmetric generator, where the generator resumes another generator, forming a resume/resume cycle.
+Generators can also have symmetric activation using resume/resume to create control-flow cycles among generators.
 (The trivial cycle is a generator resuming itself.)
 This control flow is similar to recursion for functions but without stack growth.
+Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow are creating, executing, and terminating the cycle.
+Figure~\ref{f:PingPongFullCoroutineSteps} shows the steps for symmetric control-flow using for the ping/pong program in Figure~\ref{f:CFAPingPongGen}.
+The program starts by creating the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.
 Constructing the cycle must deal with definition-before-use to close the cycle, \ie, the first generator must know about the last generator, which is not within scope.
 (This issue occurs for any cyclic data structure.)
-The example creates the generators, @ping@ and @pong@, and then assigns the partners that form the cycle.
 % (Alternatively, the constructor can assign the partners as they are declared, except the first, and the first-generator partner is set after the last generator declaration to close the cycle.)
 Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary cycle using @resume@ to activate partner generator(s).
+Once the cycle is formed, the program main resumes one of the generators, @ping@, and the generators can then traverse an arbitrary number of cycles using @resume@ to activate partner generator(s).
 Terminating the cycle is accomplished by @suspend@ or @return@, both of which go back to the stack frame that started the cycle (program main in the example).
 Note, the creator and starter may be different, \eg if the creator calls another function that starts the cycle.
 …
 Also, since local variables are not retained in the generator function, there are no objects with destructors to be called, so the cost is the same as a function return.
 Destructor cost occurs when the generator instance is deallocated by the creator.
+\begin{figure}
+\centering
+\input{FullCoroutinePhases.pstex_t}
+\vspace*{-10pt}
+\caption{Symmetric coroutine steps: Ping / Pong}
+\label{f:PingPongFullCoroutineSteps}
+\end{figure}
 \begin{figure}
 …
 \end{figure}
-\begin{figure}
-\centering
-\input{FullCoroutinePhases.pstex_t}
-\vspace*{-10pt}
-\caption{Symmetric coroutine steps: Ping / Pong}
-\label{f:PingPongFullCoroutineSteps}
-\end{figure}
 Figure~\ref{f:CPingPongSim} shows the C implementation of the \CFA symmetric generator, where there is still only one additional field, @restart@, but @resume@ is more complex because it does a forward rather than backward jump.
 Before the jump, the parameter for the next call @partner@ is placed into the register used for the first parameter, @rdi@, and the remaining registers are reset for a return.
 …
 \label{s:Coroutine}
+Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics, \ie there is an implicit closure and @suspend@ may appear in a helper function called from the coroutine main.
+A coroutine is specified by replacing @generator@ with @coroutine@ for the type.
+Stackful coroutines (Table~\ref{t:ExecutionPropertyComposition} case 5) extend generator semantics with an implicit closure and @suspend@ may appear in a helper function called from the coroutine main because of the separate stack.
+Note, simulating coroutines with stacks of generators, \eg Python with @yield from@ cannot handle symmetric control-flow.
+Furthermore, all stack components must be of generators, so it is impossible to call a library function passing a generator that yields.
+Creating a generator copy of the library function maybe impossible because the library function is opaque.
+A \CFA coroutine is specified by replacing @generator@ with @coroutine@ for the type.
 Coroutine generality results in higher cost for creation, due to dynamic stack allocation, for execution, due to context switching among stacks, and for terminating, due to possible stack unwinding and dynamic stack deallocation.
 A series of different kinds of coroutines and their implementations demonstrate how coroutines extend generators.
 First, the previous generator examples are converted to their coroutine counterparts, allowing local-state variables to be moved from the generator type into the coroutine main.
+Now the coroutine type only contains communication variables between interface functions and the coroutine main.
 \begin{center}
 \begin{tabular}{@{}l|l|l|l@{}}
 …
+}
 \end{cfa}
 A call to this function is placed at the end of the driver's coroutine-main.
+A call to this function is placed at the end of the device driver's coroutine-main.
 For complex finite-state machines, refactoring is part of normal program abstraction, especially when code is used in multiple places.
 Again, this complexity is usually associated with execution state rather than data state.
 …
 The \CFA @dtype@ property provides no \emph{implicit} copying operations and the @is_coroutine@ trait provides no \emph{explicit} copying operations, so all coroutines must be passed by reference or pointer.
 The function definitions ensure there is a statically typed @main@ function that is the starting point (first stack frame) of a coroutine, and a mechanism to read the coroutine descriptor from its handle.
 The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with corresponding arbitrary typed input and output values versus fixed ones.
+The @main@ function has no return value or additional parameters because the coroutine type allows an arbitrary number of interface functions with arbitrary typed input and output values versus fixed ones.
 The advantage of this approach is that users can easily create different types of coroutines, \eg changing the memory layout of a coroutine is trivial when implementing the @get_coroutine@ function, and possibly redefining \textsf{suspend} and @resume@.
 …
 The coroutine descriptor contains all implicit declarations needed by the runtime, \eg @suspend@/@resume@, and can be part of the coroutine handle or separate.
 The coroutine stack can appear in a number of locations and be fixed or variable sized.
+Hence, the coroutine's stack could be a variable-length structure (VLS)\footnote{
+We are examining VLSs, where fields can be variable-sized structures or arrays.
+Once allocated, a VLS is fixed sized.}
+Hence, the coroutine's stack could be a variable-length structure (VLS)
+% \footnote{
+% We are examining VLSs, where fields can be variable-sized structures or arrays.
+% Once allocated, a VLS is fixed sized.}
 on the allocating stack, provided the allocating stack is large enough.
 For a VLS stack allocation and deallocation is an inexpensive adjustment of the stack pointer, modulo any stack constructor costs to initial frame setup.
 …
 \label{s:threads}
+Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion.
+A common API for this ability is @fork@ and @join@.
+Threading (Table~\ref{t:ExecutionPropertyComposition} case 11) needs the ability to start a thread and wait for its completion, where a common API is @fork@ and @join@.
 \vspace{4pt}
 \par\noindent
 …
 For these reasons, \CFA selected monitors as the core high-level concurrency construct, upon which higher-level approaches can be easily constructed.
 Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter.\footnote{
+Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.}
+Figure~\ref{f:AtomicCounter} compares a \CFA and Java monitor implementing an atomic counter.
+(Like other concurrent programming languages, \CFA and Java have performant specializations for the basic types using atomic instructions.)
 A \newterm{monitor} is a set of functions that ensure mutual exclusion when accessing shared state.
 (Note, in \CFA, @monitor@ is short-hand for @mutex struct@.)
 …
 The total time is divided by @N@ to obtain the average time for a benchmark.
 Each benchmark experiment is run 13 times and the average appears in the table.
 All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallBenchMarks}.
+All omitted tests for other languages are functionally identical to the \CFA tests and available online~\cite{CforallConcurrentBenchmarks}.
 % tar --exclude-ignore=exclude -cvhf benchmark.tar benchmark
+% cp -p benchmark.tar /u/cforall/public_html/doc/concurrent_benchmark.tar
 \paragraph{Creation}
 …
 \uC thread                              & 523.4         & 523.9         & 7.7           \\
 Python generator                & 123.2         & 124.3         & 4.1           \\
 Node.js generator               & 32.3          & 32.2          & 0.3           \\
+Node.js generator               & 33.4          & 33.5          & 0.3           \\
 Goroutine thread                & 751.0         & 750.5         & 3.1           \\
 Rust tokio thread               & 1860.0        & 1881.1        & 37.6          \\
 …
 Python generator        & 40.9          & 41.3          & 1.5   \\
 Node.js await           & 1852.2        & 1854.7        & 16.4  \\
 Node.js generator       & 32.6          & 32.2          & 1.0   \\
+Node.js generator       & 33.3          & 33.4          & 0.3   \\
 Goroutine thread        & 143.0         & 143.3         & 1.1   \\
 Rust async await        & 32.0          & 32.0          & 0.0   \\
 …
 While control flow in \CFA has a strong start, development is still underway to complete a number of missing features.
+\vspace{-5pt}
+\paragraph{Flexible Scheduling}
+\label{futur:sched}
+\medskip
+\textbf{Flexible Scheduling:}
 An important part of concurrency is scheduling.
 Different scheduling algorithms can affect performance, both in terms of average and variation.
 …
 Currently, the \CFA pluggable scheduler is too simple to handle complex scheduling, \eg quality of service and real-time, where the scheduler must interact with mutex objects to deal with issues like priority inversion~\cite{Buhr00b}.
+\vspace{-5pt}
+\paragraph{Non-Blocking I/O}
+\label{futur:nbio}
+\smallskip
+\textbf{Non-Blocking I/O:}
 Many modern workloads are not bound by computation but IO operations, common cases being web servers and XaaS~\cite{XaaS} (anything as a service).
 These types of workloads require significant engineering to amortizing costs of blocking IO-operations.
 …
 A non-blocking I/O library is currently under development for \CFA.
+\vspace{-5pt}
+\paragraph{Other Concurrency Tools}
+\label{futur:tools}
+\smallskip
+\textbf{Other Concurrency Tools:}
 While monitors offer flexible and powerful concurrency for \CFA, other concurrency tools are also necessary for a complete multi-paradigm concurrency package.
 Examples of such tools can include futures and promises~\cite{promises}, executors and actors.
 …
 As well, new \CFA extensions should make it possible to create a uniform interface for virtually all mutual exclusion, including monitors and low-level locks.
+\vspace{-5pt}
+\paragraph{Implicit Threading}
+\label{futur:implcit}
+Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, possibly with some help from pragmas to guide the conversion.
+\smallskip
+\textbf{Implicit Threading:}
+Basic \emph{embarrassingly parallel} applications can benefit greatly from implicit concurrency, where sequential programs are converted to concurrent, with some help from pragmas to guide the conversion.
 This type of concurrency can be achieved both at the language level and at the library level.
 The canonical example of implicit concurrency is concurrent nested @for@ loops, which are amenable to divide and conquer algorithms~\cite{uC++book}.

TabularUnified doc/papers/concurrency/response2 ¶

-                      r9246ec6
+                      r27125d0
     OOPSLA 2017.
+Of our testing languages, only Java is JITTED. To ensure the Java test-programs
+correctly measured the specific feature, we consulted with Dave Dice at Oracle
+who works directly on the development of the Oracle JVM Just-in-Time
+Compiler. We modified our test programs based on his advise, and he validated
+our programs as correctly measuring the specified language feature. Hence, we
+have taken into account all issues related to performing benchmarks in JITTED
+languages.  Dave's help is recognized in the Acknowledgment section. Also, all
+the benchmark programs are publicly available for independent verification.
+Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
+test-programs correctly measured the specific feature, we consulted with Dave
+Dice at Oracle who works directly on the development of the Oracle JVM
+Just-in-Time Compiler. We modified our test programs based on his advise, and
+he validated our programs as correctly measuring the specified language
+feature. Hence, we have taken into account all issues related to performing
+benchmarks in JITTED languages.  Dave's help is recognized in the
+Acknowledgment section. Also, all the benchmark programs are publicly available
+for independent verification.
+Similarly, we verified our Node.js programs with Gregor Richards, an expert in
+just-in-time compilation for dynamic typing.
 …
 critical section); threads can arrive at any time in any order, where the
 mutual exclusion for the critical section ensures one thread executes at a
 time. Interestingly, Reed & Kanodia's mutual exclusion is Habermann's
 communication not mutual exclusion. These papers only buttress our contention
+time. Interestingly, Reed & Kanodia's critical region is Habermann's
+communication not critical section. These papers only buttress our contention
 about the confusion of these terms in the literature.
 …
 I think we may be differing on the meaning of stack. You may be imagining a
 modern stack that grows and shrink dynamically. Whereas early Fortran
+modern stack that grows and shrinks dynamically. Whereas early Fortran
 preallocated a stack frame for each function, like Python allocates a frame for
 a generator.  Within each preallocated Fortran function is a frame for local
 variables and a pointer to store the return value for a call.  The Fortran
 call/return mechanism than uses these frames to build a traditional call stack
+call/return mechanism uses these frames to build a traditional call stack
 linked by the return pointer. The only restriction is that a function stack
 frame can only be used once, implying no direct or indirect recursion.  Hence,
 …
     build coroutines by stacks of generators invoking one another.
+As we point out, coroutines built from stacks of generators have problems, such
+as no symmetric control-flow. Furthermore, stacks of generators have a problem
+with the following programming pattern.  logger is a library function called
 from a function or a coroutine, where the doit for the coroutine suspends. With
+stacks of generators, there has to be a function and generator version of
+logger to support these two scenarios. If logger is a library function, it may
 be impossible to create the generator logger because the logger function is
+Coroutines built from stacks of generators have problems, such as no symmetric
+control-flow. Furthermore, stacks of generators have a problem with the
+following programming pattern.  logger is a library function called from a
+function or a coroutine, where the doit for the coroutine suspends. With stacks
+of generators, there has to be a function and generator version of logger to
+support these two scenarios. If logger is a library function, it may be
+impossible to create the generator logger because the logger function is
 opaque.
 …
+  }
+Additonal text has been added to the start of Section 3.2 address this comment.
 …
     * prefer generators for simple computations that yield up many values,
-This description does not cover output or matching generators that do not yield
-many or any values. For example, the output generator Fmt yields no value; the
-device driver yields a value occasionally once a message is found. Furthermore,
-real device drivers are not simple; there can have hundreds of states and
-transitions. Imagine the complex event-engine for a web-server written as a
-generator.
     * prefer coroutines for more complex processes that have significant
       internal structure,
+As for generators, complexity is not the criterion for selection. A coroutine
+brings generality to the implementation because of the addition stack, whereas
+generators have restrictions on standard software-engining practises: variable
+placement, no helper functions without creating an explicit generator stack,
+and no symmetric control-flow. Whereas, the producer/consumer example in Figure
+uses stack variable placement, helpers, and simple ping/pong-style symmetric
+We do not believe the number of values yielded is an important factor is
+choosing between a generator or coroutine; either form can receive (input) or
+return (output) millions of values. As well, simple versus complex computations
+is also not a criterion for selection as both forms can be very
+sophisticated. As stated in the paper, a coroutine brings generality to the
+implementation because of the addition stack, whereas generators have
+restrictions on standard software-engining practices: variable placement, no
+helper functions without creating an explicit generator stack, and no symmetric
 control-flow.
 …
     in purpose.
 Given that you asked about this it before, I believe other readers might also
+ask the same question because async-await is very popular. So I think this
+section does help to position the work in the paper among other work, and
 hence, it is appropriate to keep it in the paper.
+Given that you asked about this before, I believe other readers might also ask
+the same question because async-await is very popular. So I think this section
+does help to position the work in the paper among other work, and hence, it is
+appropriate to keep it in the paper.
 …
 To handle the other 5% of the cases, there is a trivial Cforall pattern
 providing Java-style start/join. The additional cost for this pattern is 2
 light-weight thread context-switches.
+providing Java-style start/join. The additional cost for this pattern is small
+in comparison to the work performed between the start and join.
   thread T {};

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats: