Reviewing: 1

    I still have a couple of issues --- perhaps the largest is that it's
    still not clear at this point in the paper what some of these options
    are, or crucially how they would be used. I don't know if it's
    possible to give high-level examples or use cases to be clear about
    these up front - or if that would duplicate too much information from
    later in the paper - either way expanding out the discussion - even if
    just two a couple of sentences for each row - would help me more.

Section 2.1 is changed to address this suggestion.


    * 1st para section 2 begs the question: why not support each
      dimension independently, and let the programmer or library designer
      combine features?

As seen in Table 1, not all of the combinations work, and having programmers
directly use these low-level mechanisms is error prone. Accessing these
fundamental mechanisms through higher-level constructs has always been the
purpose of a programming language.


    * Why must there "be language mechanisms to create, block/unblock, and join
      with a thread"?  There aren't in Smalltalk (although there are in the
      runtime).  Especially given in Cforall those mechanisms are *implicit* on
      thread creation and destruction?

Fixed, changed sentence to:

 A programmer needs mechanisms to create, block and unblock, and join with a
 thread, even if these basic mechanisms are supplied indirectly through
 high-level features.


   * "Case 1 is a function that borrows storage for its state (stack
     frame/activation) and a thread from its invoker"
  
     this much makes perfect sense to me, but I don't understand how a
     non-stateful, non-threaded function can then retain
  
     "this state across callees, ie, function local-variables are
     retained on the stack across calls."
  
     how can it retain function-local values *across calls* when it
     doesn't have any functional-local state?

In the following example:

  void foo() {
     // local variables and code
  }
  void bar() {
     // local variables
     foo();
  }

bar is the caller and foo is the callee. bar borrows the program stack and
thread to make the call to foo. When foo, the callee, is executing, bar's local
variables (state) is retained on the *borrowed* stack across the call. (Note, I
added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
foo's local variables are also retain on the borrowed stack. When foo and bar
return, all of their local state is gone (not retained). This behaviour is
standard call/return semantics in an imperative language.


     I'm not sure if I see two separate cases here - roughly equivalent
     to C functions without static storage, and then C functions *with*
     static storage.

Yes, but there is only one instance of the static storage across all
activations of the C function. For generators and coroutines, each instance has
its own state, like an object in an OO language.


     I assumed that was the distinction between cases 1 & 3; but perhaps the
     actual distinction is that 3 has a suspend/resume point, and so the
     "state" in figure 1 is this component of execution state (viz figs 1 & 2),
     not the state representing the cross-call variables?

So case 3 is like an object with the added ability to retain where it was last
executing.  When a generator is resumed, the generator object (structure
instance) is passed as an explicit reference, and within this object is the
restart location in the generator's "main". When the generator main executes,
it uses the borrowed stack for its local variables and any functions it calls,
just like an object member borrows the stack for its local variables but also
has an implicit receiver to the object state.  A generator can have static
storage, too, which is a single instance across all generator instances of that
type, as for static storage in an object type. All the kinds of storage are
at play with semantics that is the same as in other languages.


    > but such evaluation isn't appropriate for garbage-collected or JITTed
    > languages like Java or Go.
    For JITTed languages in particular, reporting peak performance needs to
    "warm up" the JIT with a number of iterators before beginning
    measurement. Actually for JIT's its even worse: see Edd Barrett et al
    OOPSLA 2017.
   
Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
test-programs correctly measured the specific feature, we consulted with Dave
Dice at Oracle who works directly on the development of the Oracle JVM
Just-in-Time Compiler. We modified our test programs based on his advise, and
he validated our programs as correctly measuring the specified language
feature. Dave's help is recognized in the Acknowledgment section.  Similarly,
we verified our Node.js programs with Gregor Richards, an expert in
just-in-time compilation for dynamic typing.  Hence, we have taken into account
all issues related to performing benchmarks in JITTED languages.  Also, all the
benchmark programs are publicly available for independent verification.


   * footnote A - I've looked at various other papers & the website to try to
     understand how "object-oriented" Cforall is - I'm still not sure.  This
     footnote says Cforall has "virtuals" - presumably virtual functions,
     i.e. dynamic dispatch - and inheritance: that really is OO as far as I
     (and most OO people) are concerned.  For example Haskell doesn't have
     inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
     or things like Cecil and Dylan are considered OO even though they have
     "multiple function parameters as receivers", lack "lexical binding between
     a structure and set of functions", and don't have explicit receiver
     invocation syntax.  Python has receiver syntax, but unlike Java or
     Smalltalk or C++, method declarations still need to have an explicit
     "self" receiver parameter.  Seems to me that Go, for example, is
     more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
     and an explicit receiver syntax but that's not determinative); while Rust
     lacks dynamic dispatch built-in.  C is not OO as a language, but as you
     say given it supports function pointers with structures, it does support
     an OO programming style.
   
     This is why I again recommend just not buying into this fight: not making
     any claims about whether Cforall is OO or is not - because as I see it,
     the rest of the paper doesn't depend on whether Cforall is OO or not.
     That said: this is just a recommendation, and I won't quibble over this
     any further.

We believe it is important to identify Cforall as a non-OO language because it
heavily influences the syntax and semantics used to build its concurrency.
Since many aspects of Cforall are not OO, the rest of the paper *does* depend
on Cforall being identified as non-OO, otherwise readers would have
significantly different expectations for the design. We did not mean to suggest
that languages that support function pointers with structures support an OO
style. We revised the footnote to avoid this interpretation. Finally, Golang
does not identify itself as an OO language.
https://golang.org/doc/faq#Is_Go_an_object-oriented_language


   * is a "monitor function" the same as a "mutex function"?
     if so the paper should pick one term; if not, make the distinction clear.

Fixed. Picked "mutex". Changed the language and all places in the paper.


   * "As stated on line 1 because state declarations from the generator
      type can be moved out of the coroutine type into the coroutine main"
  
      OK sure, but again: *why* would a programmer want to do that?
      (Other than, I guess, to show the difference between coroutines &
      generators?)  Perhaps another way to put this is that the first
      para of 3.2 gives the disadvantages of coroutines vs-a-vs
      generators, briefly describes the extended semantics, but never
      actually says why a programmer may want those extended semantics,
      or how they would benefit.  I don't mean to belabour the point,
      but (generalist?) readers like me would generally benefit from
      those kinds of discussions about each feature throughout the
      paper: why might a programmer want to use them?

On page 8, it states:

  Having to manually create the generator closure by moving local-state
  variables into the generator type is an additional programmer burden (removed
  by the coroutine in Section 3.2). ...

also these variables can now be refactored into helper function, where the
helper function suspends at arbitrary call depth. So imagine a coroutine helper
function that is only called occasionally within the coroutine but it has a
large array that is retained across suspends within the helper function. For a
generator, this large array has to be declared in the generator type enlarging
each generator instance even through the array is only used occasionally.
Whereas, the coroutine only needs the array allocated when needed. Now a
coroutine has a stack which occupies storage, but the maximum stack size only
needs to be the call chain allocating the most storage, whereas the generator
has a maximum size of all variable that could be created.


    > p17 if the multiple-monitor entry procedure really is novel, write a paper
    > about that, and only about that.
    > We do not believe this is a practical suggestion.
    * I'm honestly not trying to be snide here: I'm not an expert on monitor or
      concurrent implementations. Brinch Hansen's original monitors were single
      acquire; this draft does not cite any other previous work that I could
      see. I'm not suggesting that the brief mention of this mechanism
      necessarily be removed from this paper, but if this is novel (and a clear
      advance over a classical OO monitor a-la Java which only acquires the
      distinguished receiver) then that would be worth another paper in itself.

First, to explain multiple-monitor entry in Cforall as a separate paper would
require significant background on Cforall concurrency, which means repeating
large sections of this paper. Second, it feels like a paper just on
multiple-monitor entry would be a little thin, even if the capability is novel
to the best of our knowledge. Third, we feel multiple-monitor entry springs
naturally from the overarching tone in the paper that Cforall is a non-OO
programming language, allowing multiple-mutex receivers.


    My typo: the paper's conclusion should come at the end, after the
    future work section.

Combined into a Conclusions and Future Work section.



Reviewing: 2

    on the Boehm paper and whether code is "all sequential to the compiler": I
    now understand the authors' position better and suspect we are in violent
    agreement, except for whether it's appropriate to use the rather breezy
    phrase "all sequential to the compiler". It would be straightforward to
    clarify that code not using the atomics features is optimized *as if* it
    were sequential, i.e. on the assumption of a lack of data races.

Fixed, "as inline and library code is compiled as sequential without any
explicit concurrent directive."


    on the distinction between "mutual exclusion" and "synchronization": the
    added citation does help, in that it makes a coherent case for the
    definition the authors prefer. However, the text could usefully clarify
    that this is a matter of definition not of fact, given especially that in
    my assessment the authors' preferred definition is not the most common
    one. (Although the mention of Hoare's apparent use of this definition is
    one data point, countervailing ones are found in many contemporaneous or
    later papers, e.g. Habermann's 1972 "Synchronization of Communicating
    Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
    eventcounts and sequencers" (CACM (22(2)) and so on.)

Fixed, "We contend these two properties are independent, ...".

With respect to the two papers, Habermann fundamentally agrees with our
definitions, where the term mutual exclusion is the same but Habermann uses the
term "communication" for our synchronization. However, the term "communication"
is rarely used to mean synchronization. The fact that Habermann collectively
calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
state:

  By mutual exclusion we mean any mechanism that forces the time ordering of
  execution of pieces of code, called critical regions, in a system of
  concurrent processes to be a total ordering.

But there is no timing order for a critical region (which I assume means
critical section); threads can arrive at any time in any order, where the
mutual exclusion for the critical section ensures one thread executes at a
time. Interestingly, Reed & Kanodia's critical region is Habermann's
communication not critical section. These papers only buttress our contention
about the confusion of these terms in the literature.


    section 2 (an expanded version of what was previously section 5.9) lacks
    examples and is generally obscure and allusory ("the most advanced feature"
    -- name it! "in triplets" -- there is only one triplet!;

Fixed.

 These independent properties can be used to compose different language
 features, forming a compositional hierarchy, where the combination of all
 three is the most advanced feature, called a thread/task/process. While it is
 possible for a language to only provide threads for composing
 programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
 solutions to certain classes of problems.


    what are "execution locations"? "initialize" and "de-initialize"

Fixed through an example at the end of the sentence.

 \item[\newterm{execution state}:] is the state information needed by a
 control-flow feature to initialize and manage both compute data and execution
 location(s), and de-initialize. For example, calling a function initializes a
 stack frame including contained objects with constructors, manages local data
 in blocks and return locations during calls, and de-initializes the frame by
 running any object destructors and management operations.


    what? "borrowed from the invoker" is a concept in need of explaining or at
    least a fully explained example -- in what sense does a plain function
    borrow" its stack frame?

A function has no storage except for static storage; it gets its storage from
the program stack. For a function to run it must borrow storage from somewhere.
When the function returns, the borrowed storage is returned. Do you have a more
appropriate word?


    "computation only" as opposed to what?

This is a term in concurrency for operations that compute without blocking,
i.e., the operation starts with everything it needs to compute its result and
runs to completion, blocking only when it is done and returns its result.
Computation only occurs in "embarrassingly parallel" problems.


    in 2.2, in what way is a "request" fundamental to "synchronization"?

I assume you are referring to the last bullet.

 Synchronization must be able to control the service order of requests
 including prioritizing selection from different kinds of outstanding requests,
 and postponing a request for an unspecified time while continuing to accept
 new requests.

Habermann states on page 173

 Looking at deposit we see that sender and receiver should be synchronized with
 respect to buffer overflow: deposit of another message must be delayed if
 there is no empty message frame, and such delay should be removed when the
 first empty frame becomes available.  This is programmed as
 synchronization(frame) : deposit is preceded by wait(frame), accept is
 followed by signal( frame), and the constant C[frame] is set equal to
 "bufsize."

Here synchronization is controlling the service order of requests: when the
buffer is full, requests for insert (sender) are postponed until a request to
remove (receiver) occurs.  Without the ability to control service order among
requests, the producer/consumer problem with a bounded buffer cannot be
solved. Hence, this capability is fundamental.


    and the "implicitly" versus "explicitly" point needs stating as elsewhere,
    with a concrete example e.g. Java built-in mutexes versus
    java.util.concurrent).

Fixed.

 MES must be available implicitly in language constructs, \eg Java built-in
 monitors, as well as explicitly for specialized requirements, \eg
 @java.util.concurrent@, because requiring programmers to build MES using
 low-level locks often leads to incorrect programs.


    section 6: 6.2 omits the most important facts in preference for otherwise
    inscrutable detail: "identify the kind of parameter" (first say *that there
    are* kinds of parameter, and what "kinds" means!); "mutex parameters are
    documentation" is misleading (they are also semantically significant!) and
    fails to say *what* they mean; the most important thing is surely that
    'mutex' is a language feature for performing lock/unlock operations at
    function entry/exit. So say it!

These sections have been rewritten address the comments.


    The meanings of examples f3 and f4 remain unclear.

Rewrote the paragraph.


    Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
    meaning from Figure 12,

Defined Hoare's urgent list at the start of the paragraph.


    but that Figure is incomprehensible to me), and we are told of "external
    scheduling"'s long history in Ada but not clearly what it actually means;

We do not know how to address your comment because the description is clear to
us and other non-reviewers who have read the paper. I had forgotten an
important citation to prior work on this topic, which is now referenced:

   Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
   Classification. ACM Computing Surveys. 1995; 27(1):63-107.

This citation is a 45 page paper expanding on the topic of internal scheduling.
I also added a citation to Hoare's monitor paper where signal_block is defined:

  When a process signals a condition on which another process is waiting, the
  signalling process must wait until the resumed process permits it to
  proceed.

External scheduling from Ada was subsequently added to uC++ and Cforall.
Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
external-scheduling with Go channel external-scheduling. So external scheduling
exists in languages beyond Ada.


    6.4's description of "waitfor" tells us it is different from an if-else
    chain but tries to use two *different* inputs to tell us that the behavior
    is different; tell us an instance where *the same* values of C1 and C2 give
    different behavior (I even wrote out a truth table and still don't see the
    semantic difference)

Again, it is unclear what is the problem. For the if-statement, if C1 is true,
it only waits for a call to mem1, even if C2 is true and there is an
outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
happen with the if-statement. So for all true when clauses, any outstanding
call is immediately accepted. If there are no outstanding calls, the waitfor
blocks until the next call to any of the true when clauses occurs. I added the
following sentence to further clarify.

 Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.

Note, the same parallel semantics exists for the Go select statement with
respect to waiting for a set of channels to receive data. While Go select does
not have a when clause, it would be trivial to add it, making the select more
expressive.


    The authors frequently use bracketed phrases, and sometimes slashes "/", in
    ways that are confusing and/or detrimental to readability.  Page 13 line
    2's "forward (backward)" is one particularly egregious example.  In general
    I would recommend the the authors try to limit their use of parentheses and
    slashes as a means of forcing a clearer wording to emerge.

Many of the slashes and parentheticals have been removed. Some are retained to
express joined concepts: call/return, suspend/resume, resume/resume, I/O.


    Also, the use of "eg." is often cursory and does not explain the examples
    given, which are frequently a one- or two-word phrase of unclear referent.

A few of these are fixed.


    Considering the revision more broadly, none of the more extensive or
    creative rewrites I suggested in my previous review have been attempted,
    nor any equivalent efforts to improve its readability.

If you reread the previous response, we addressed all of your suggestions except

        An expositional idea occurs: start the paper with a strawman
        naive/limited realisation of coroutines -- say, Simon Tatham's popular
        "Coroutines in C" web page -- and identify point by point what the
        limitations are and how C\/ overcomes them. Currently the presentation
        is often flat (lacking motivating contrasts) and backwards (stating
        solutions before problems). The foregoing approach might fix both of
        these.
    
    We prefer the current structure of our paper and believe the paper does
    explain basic coding limitations and how they are overcome in using
    high-level control-floe mechanisms.

and we have addressed readability issues in this version.


    The hoisting of the former section 5.9 is a good idea, but the newly added
    material accompanying it (around Table 1) suffers fresh deficiencies in
    clarity. Overall the paper is longer than before, even though (as my
    previous review stated), I believe a shorter paper is required in order to
    serve the likely purpose of publication. (Indeed, the authors' letter
    implies that a key goal of publication is to build community and gain
    external users.)

This comment is the referee's opinion, which we do not agree with it. Our
previous 35 page SP&E paper on Cforall:

  Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
  Programming Language Features to C. Softw. Pract. Exper. 2018;
  48(12):2111-2146.

has a similar structure and style to this paper, and it received an award from
John Wiley & Sons:

 Software: Practice & Experience for articles published between January 2017
 and December 2018, "most downloads in the 12 months following online
 publication showing the work generated immediate impact and visibility,
 contributing significantly to advancement in the field".

So we have demonstrated an ability to build community and gain external users.

  
    Given this trajectory, I no longer see a path to an acceptable revision of
    the present submission. Instead I suggest the authors consider splitting
    the paper in two: one half about coroutines and stack management, the other
    about mutexes, monitors and the runtime. (A briefer presentation of the
    runtime may be helpful in the first paper also, and a brief recap of the
    generator and coroutine support is obviously needed in the second too.)

Again we disagree with the referee's suggestion to vastly restructure the
paper. What advantage is there is presenting exactly the same material across
two papers, which will likely end up longer than a single paper? The current
paper has a clear theme that fundamental execution properties generate a set of
basic language mechanisms, and we then proceed to show how these mechanisms can
be designed into the programing language Cforall.


    I do not buy the authors' defense of the limited practical experience or
    "non-micro" benchmarking presented. Yes, gaining external users is hard and
    I am sympathetic on that point. But building something at least *somewhat*
    substantial with your own system should be within reach, and without it the
    "practice and experience" aspects of the work have not been explored.
    Clearly C\/ is the product of a lot of work over an extended period, so it
    is a surprise that no such experience is readily available for inclusion.

Understood. There are no agreed-upon concurrency benchmarks, which is why
micro-benchmarks are often used. Currently, the entire Cforall runtime is
written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
be thread safe, automatically detects the use of concurrency features at link
time, and bootstraps into a threaded runtime almost immediately at program
startup so threads can be declared as global variables and may run to
completion before the program main starts. The concurrent core of the runtime
is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
and high-level features. In other words, the concurrent core uses itself as
quickly as possible to start using high-level concurrency. There are 12,000+
LOC in the Cforall tests-suite used to verify language features, which are run
nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
as aggressively testing each language feature, and classical examples such as
bounded buffer, dating service, matrix summation, quickSort, binary insertion
sort, etc.  More experience will be available soon, based on ongoing work in
the "future works" section. Specifically, non-blocking I/O is working with the
new Linux io_uring interface and a new high-performance ready-queue is under
construction to take into account this change. With non-blocking I/O, it will
be possible to write applications like high-performance web servers, as is now
done in Rust and Go. Also, completed is Java-style executors for work-based
concurrent programming and futures. Under construction is a high-performance
actor system.


    It does not seem right to state that a stack is essential to Von Neumann
    architectures -- since the earliest Von Neumann machines (and indeed early
    Fortran) did not use one.

Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf

  Since a subprogram may call for other subprograms to any desired depth, a
  particular CALL statement may be defined by a pyramid of multi-level
  subprograms.

I think we may be differing on the meaning of stack. You may be imagining a
modern stack that grows and shrinks dynamically. Whereas early Fortran
preallocated a stack frame for each function, like Python allocates a frame for
a generator.  Within each preallocated Fortran function is a frame for local
variables and a pointer to store the return value for a call.  The Fortran
call/return mechanism uses these frames to build a traditional call stack
linked by the return pointer. The only restriction is that a function stack
frame can only be used once, implying no direct or indirect recursion.  Hence,
without a stack mechanism, there can be no call/return to "any desired depth",
where the maximum desired depth is limited by the number of functions. So
call/return requires some form of a stack, virtually all programming language
have call/return, past and present, and these languages run on Von Neumann
machines that do not distinguish between program and memory space, have mutable
state, and the concept of a pointer to data or code.


    To elaborate on something another reviewer commented on: it is a surprise
    to find a "Future work" section *after* the "Conclusion" section. A
    "Conclusions and future work" section often works well.

Done.



Reviewing: 3

    but it remains really difficult to have a good sense of which idea I should
    use and when. This applies in different ways to different features from the
    language:

    * coroutines/generators/threads: here there is some discussion, but it can
      be improved.
    * internal/external scheduling: I didn't find any direct comparison between
      these features, except by way of example.

See changes below.


    I would have preferred something more like a table or a few paragraphs
    highlighting the key reasons one would pick one construct or the other.

Section 2.1 is changed to address this suggestion.


    The discussion of clusters and pre-emption in particular feels quite rushed.

We believe a brief introduction to the Cforall runtime structure is important
because clustering within a user-level versus distributed system is unusual,
Furthermore, the explanation of preemption is important because several new
languages, like Go and Rust tokio, are not preemptive. Rust threads are
preemptive only because it uses kernel threads, which UNIX preempts.


    * Recommend to shorten the comparison on coroutine/generator/threads in
      Section 2 to a paragraph with a few examples, or possibly a table
      explaining the trade-offs between the constructs

Not done, see below.


    * Recommend to clarify the relationship between internal/external
      scheduling -- is one more general but more error-prone or low-level?

Done, see below.


    There is obviously a lot of overlap between these features, and in
    particular between coroutines and generators. As noted in the previous
    review, many languages have chosen to offer *only* generators, and to
    build coroutines by stacks of generators invoking one another.

Coroutines built from stacks of generators have problems, such as no symmetric
control-flow. Furthermore, stacks of generators have a problem with the
following programming pattern.  logger is a library function called from a
function or a coroutine, where the doit for the coroutine suspends. With stacks
of generators, there has to be a function and generator version of logger to
support these two scenarios. If logger is a library function, it may be
impossible to create the generator logger because the logger function is
opaque.

  #include <fstream.hfa>
  #include <coroutine.hfa>

  forall( otype T | { void doit( T ); } )
  void logger( T & t ) {
      doit( t );
  }

  coroutine C {};
  void main( C & c ) with( c ) {
      void doit( C & c ) { suspend; }
      logger( c );
  }
  void mem( C & c ) {
      resume( c );
  }

  int main() {
      C c;
      mem( c );
      mem( c );
  
      struct S {};
      S s;
      void doit( S & s ) {}
      logger( s );
  }

Additional text has been added to the start of Section 3.2 addressing this
comment.


    In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
    that embodies this "top down" approach. It starts, "programmers can now
    answer three basic questions", and thus gives some practical advice for
    which construct you should use and when. I think giving some examples of
    specific applications that this paragraph, combined with some examples of
    cases where each construct was needed, would be a better approach.

    I don't think this comparison needs to be very long. It seems clear enough
    that one would

    * prefer generators for simple computations that yield up many values,
    * prefer coroutines for more complex processes that have significant
      internal structure,

We do not believe the number of values yielded is an important factor is
choosing between a generator or coroutine; either form can receive (input) or
return (output) millions of values. As well, simple versus complex computations
is also not an absolute criterion for selection as both forms can be simple or
complex.

As stated in the paper, a coroutine brings generality to the implementation
because of the addition stack, whereas generators have restrictions on standard
software-engineering practices: variable placement, no helper functions without
creating an explicit generator stack, and no symmetric control-flow.  It is
true that simple generators probably suffer less from restrictions. Also,
simplicity means the amount of work done is small so the cost of the
suspend/resume becomes a factor, where experiments show suspend/resume for a
generator is 10 times faster than for a coroutines.


    * prefer threads for cases where parallel execution is desired or needed.

Agreed, but this description does not mention mutual exclusion and
synchronization, which is essential in any meaningful concurrent program.

Our point here is to illustrate that a casual "top down" explanation is
insufficient to explain the complexity of the underlying execution properties.
We presented some rule-of-thumbs at the end of Section 2 but programmers must
understand all the underlying mechanisms and their interactions to exploit the
execution properties to their fullest, and to understand when a programming
language does or does not provide a desired mechanism.


    I did appreciate the comparison in Section 2.3 between async-await in
    JS/Java and generators/coroutines. I agree with its premise that those
    mechanisms are a poor replacement for generators (and, indeed, JS has a
    distinct generator mechanism, for example, in part for this reason).  I
    believe I may have asked for this in a previous review, but having read it,
    I wonder if it is really necessary, since those mechanisms are so different
    in purpose.

Given that you asked about this before, I believe other readers might also ask
the same question because async-await is very popular. So I think this section
does help to position the work in the paper among other work, and hence, it is
appropriate to keep it in the paper.


    I find the motivation for supporting both internal and external scheduling
    to be fairly implicit. After several reads through the section, I came to
    the conclusion that internal scheduling is more expressive than external
    scheduling, but sometimes less convenient or clear. Is this correct? If
    not, it'd be useful to clarify where external scheduling is more
    expressive.

    I would find it very interesting to try and capture some of the properties
    that make internal vs external scheduling the better choice.

    For example, it seems to me that external scheduling works well if there
    are only a few "key" operations, but that internal scheduling might be
    better otherwise, simply because it would be useful to have the ability to
    name a signal that can be referenced by many methods.

To address this point, the last paragraph on page 22 (now page 23) has been
augmented to the following:

 Given external and internal scheduling, what guidelines can a programmer use
 to select between them?  In general, external scheduling is easier to
 understand and code because only the next logical action (mutex function(s))
 is stated, and the monitor implicitly handles all the details.  Therefore,
 there are no condition variables, and hence, no wait and signal, which reduces
 coding complexity and synchronization errors.  If external scheduling is
 simpler than internal, why not use it all the time?  Unfortunately, external
 scheduling cannot be used if: scheduling depends on parameter value(s) or
 scheduling must block across an unknown series of calls on a condition
 variable (\ie internal scheduling).  For example, the dating service cannot be
 written using external scheduling.  First, scheduling requires knowledge of
 calling parameters to make matching decisions and parameters of calling
 threads are unavailable within the monitor.  Specifically, a thread within the
 monitor cannot examine the @ccode@ of threads waiting on the calling queue to
 determine if there is a matching partner.  (Similarly, if the bounded buffer
 or readers/writer are restructured with a single interface function with a
 parameter denoting producer/consumer or reader/write, they cannot be solved
 with external scheduling.)  Second, a scheduling decision may be delayed
 across an unknown number of calls when there is no immediate match so the
 thread in the monitor must block on a condition.  Specifically, if a thread
 determines there is no opposite calling thread with the same @ccode@, it must
 wait an unknown period until a matching thread arrives.  For complex
 synchronization, both external and internal scheduling can be used to take
 advantage of best of properties of each.


    Consider the bounded buffer from Figure 13: if it had multiple methods for
    removing elements, and not just `remove`, then the `waitfor(remove)` call
    in `insert` might not be sufficient.

Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
your request:

  waitfor( remove : buffer ); or waitfor( remove2 : buffer );

and its shorthand form (not shown in the paper)

  waitfor( remove, remove2 : buffer );

A call to either of these remove functions satisfies the waitfor (exact
selection details are discussed in Section 6.4).


    The same is true, I think, of the `signal_block` function, which I
    have not encountered before;

In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
Structuring Concept", it states on page 551:

 When a process signals a condition on which another process is waiting, the
 signalling process must wait until the resumed process permits it to
 proceed. We therefore introduce for each monitor a second semaphore "urgent"
 (initialized to 0), on which signalling processes suspend themselves by the
 operation P(urgent).

Hence, the original definition of signal is in fact signal_block, i.e., the
signaller blocks. Later implementations of monitor switched to signaller
nonblocking because most signals occur before returns, which allows the
signaller to continue execution, exit the monitor, and run concurrently with
the signalled thread that restarts in the monitor. When the signaller is not
going to exit immediately, signal_block is appropriate.


    it seems like its behavior can be modeled with multiple condition
    variables, but that's clearly more complex.

Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
Surveys, 27(1):63-107, that all extant monitors with different signalling
semantics can be transformed into each other. However, some transformations are
complex and runtime expensive.


    One question I had about `signal_block`: what happens if one signals
    but no other thread is waiting? Does it block until some other thread
    waits? Or is that user error?

On page 20, it states:

  Signalling is unconditional because signalling an empty condition queue does
  nothing.

To the best of our knowledge, all monitors have the same semantics for
signalling an empty condition queue, regardless of the kind of signal, i.e.,
signal or signal_block.


    I believe that one difference between the Go program and the Cforall
    equivalent is that the Goroutine has an associated queue, so that
    multiple messages could be enqueued, whereas the Cforall equivalent is
    effectively a "bounded buffer" of length 1. Is that correct?

Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
send so both are synchronous communication.


    I think this should be stated explicitly. (Presumably, one could modify the
    Cforall program to include an explicit vector of queued messages if
    desired, but you would also be reimplementing the channel abstraction.)

Fixed, by adding the following sentences:

  The different between call and channel send occurs for buffered channels
  making the send asynchronous.  In \CFA, asynchronous call and multiple
  buffers is provided using an administrator and worker
  threads~\cite{Gentleman81} and/or futures (not discussed).


    Also, in Figure 20, I believe that there is a missing `mutex` keyword.
    
Fixed.


    I was glad to see that the paper acknowledged that Cforall still had
    low-level atomic operations, even if their use is discouraged in favor of
    higher-level alternatives.

There was never an attempt to not acknowledged that Cforall had low-level
atomic operations. The original version of the paper stated:

  6.6 Low-level Locks
  For completeness and efficiency, Cforall provides a standard set of low-level
  locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
  instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.

and that section is still in the paper. In fact, we use these low-level
mechanisms to build all of the high-level concurrency constructs in Cforall.
  

    However, I still feel that the conclusion overstates the value of the
    contribution here when it says that "Cforall high-level race-free monitors
    and threads provide the core mechanisms for mutual exclusion and
    synchronization, without the need for volatile and atomics". I feel
    confident that Java programmers, for example, would be advised to stick
    with synchronized methods whenever possible, and it seems to me that they
    offer similar advantages -- but they sometimes wind up using volatiles for
    performance reasons.

I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
programs can achieve very good performance without volatile or atomics because
the runtime system has already used these low-level capabilities to build an
efficient set of high-level concurrency constructs.

0.1% of the time programmers need to build their own locks and synchronization
mechanisms. This need also occurs for storage management. Both of these
mechanisms are allowed in Cforall but are fraught with danger and should be
discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
fencing for a WSO memory model, such as on the ARM. It doesn't help that the
C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
Dave Dice and me would agree that 99% of hand-crafted locks created by
programmers are broken and/or non-portable.


    I was also confused by the term "race-free" in that sentence. In
    particular, I don't think that Cforall has any mechanisms for preventing
    *data races*, and it clearly doesn't prevent "race conditions" (which would
    bar all sorts of useful programs). I suppose that "race free" here might be
    referring to the improvements such as removing barging behavior.

We use the term "race free" to mean the same as Boehm/Adve's "data-race
freedom" in

  Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
  Memory Models. Communications ACM. 2012; 55(2):48-54.
  https://queue.acm.org/detail.cfm?id=2088916

which is cited in the paper. Furthermore, we never said that Cforall has
mechanisms for preventing *all* data races. We said Cforall high-level
race-free monitors and threads[, when used with mutex access function] (added
to the paper), implies no data races within these constructs, unless a
programmer directly publishes shared state. This approach is exactly what
Boehm/Adve advocate for the vast majority of concurrent programming.


    It would perhaps be more interesting to see a comparison built using [tokio] or
    [async-std], two of the more prominent user-space threading libraries that
    build on Rust's async-await feature (which operates quite differently than
    Javascript's async-await, in that it doesn't cause every aync function call to
    schedule a distinct task).

Done.


    Several figures used the `with` keyword. I deduced that `with(foo)` permits
    one to write `bar` instead of `foo.bar`. It seems worth introducing.
    Apologies if this is stated in the paper, if so I missed it.

Page 6, footnote F states:

  The Cforall "with" clause opens an aggregate scope making its fields directly
  accessible, like Pascal "with", but using parallel semantics; multiple
  aggregates may be opened.


    On page 20, section 6.3, "external scheduling and vice versus" should be
    "external scheduling and vice versa".

Fixed.


    On page 5, section 2.3, the paper states "we content" but it should be "we
    contend".

Fixed.


    Page 1. I don't believe that it s fair to imply that Scala is "research
    vehicle" as it is used by major players, Twitter being the most prominent
    example.

Fixed. Removed "research".


    Page 15. Must Cforall threads start after construction (e.g. see your
    example on page 15, line 21)?

Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
programmers want threads to start immediately. (Most Java programs have no code
between a thread declaration and the call to start the thread.)  Therefore,
this semantic should be the default because (see page 13):

  Alternatives, such as explicitly starting threads as in Java, are repetitive
  and forgetting to call start is a common source of errors.

To handle the other 5% of the cases, there is a trivial Cforall pattern
providing Java-style start/join. The additional cost for this pattern is small
in comparison to the work performed between the start and join.

  thread T {};
  void start( T & mutex ) {} // any function name
  void join( T & mutex ) {} // any function name
  void main( T & t ) {
      sout | "start";
      waitfor( start : t ); // wait to be started
      sout | "restart"; // perform work
      waitfor( join : t ); // wait to be joined
      sout | "join";
  }

  int main() {
      T t[3]; // threads start and delay
      sout | "need to start";
      for ( i; 3 ) start( t[i] );
      sout | "need to join";
      for ( i; 3 ) join( t[i] );
      sout | "threads stopped";
  } // threads deleted from stack

  $ a.out
  need to start
  start
  start
  start
  need to join
  restart
  restart
  restart
  threads stopped
  join
  join
  join


    Page 18, line 17: is using

Fixed.
