Index: doc/papers/concurrency/response2
===================================================================
--- doc/papers/concurrency/response2	(revision 04b4a7113ee67169f02e5f1dd053025c3ef887de)
+++ doc/papers/concurrency/response2	(revision 04b4a7113ee67169f02e5f1dd053025c3ef887de)
@@ -0,0 +1,1008 @@
+Reviewing: 1
+
+    I still have a couple of issues --- perhaps the largest is that it's
+    still not clear at this point in the paper what some of these options
+    are, or crucially how they would be used. I don't know if it's
+    possible to give high-level examples or use cases to be clear about
+    these up front - or if that would duplicate too much information from
+    later in the paper - either way expanding out the discussion - even if
+    just two a couple of sentences for each row - would help me more.
+
+Section 2.1 is changed to address this suggestion.
+
+
+    * 1st para section 2 begs the question: why not support each
+      dimension independently, and let the programmer or library designer
+      combine features?
+
+As seen in Table 1, not all of the combinations work, and having programmers
+directly use these low-level mechanisms is error prone. Accessing these
+fundamental mechanisms through higher-level constructs has always been the
+purpose of a programming language.
+
+
+    * Why must there "be language mechanisms to create, block/unblock, and join
+      with a thread"?  There aren't in Smalltalk (although there are in the
+      runtime).  Especially given in Cforall those mechanisms are *implicit* on
+      thread creation and destruction?
+
+The best description of Smalltalk concurrency I can find is in J. Hunt,
+Smalltalk and Object Orientation, Springer-Verlag London Limited, 1997, Chapter
+31 Concurrency in Smalltalk. It states on page 332:
+
+  For a process to be spawned from the current process there must be some way
+  of creating a new process. This is done using one of four messages to a
+  block. These messages are:
+
+    aBlock fork: This creates and schedules a process which will execute the
+    block. The priority of this process is inherited from the parent process.
+    ...
+
+  The Semaphore class provides facilities for achieving simple synchronization,
+  it is simple because it only allows for two forms of communication signal and
+  wait.
+
+Hence, "aBlock fork" creates, "Semaphore" blocks/unblocks (as does message send
+to an aBlock object), and garbage collection of an aBlock joins with its
+thread. The fact that a programmer *implicitly* does "fork", "block"/"unblock",
+and "join", does not change their fundamental requirement.
+
+
+   * "Case 1 is a function that borrows storage for its state (stack
+     frame/activation) and a thread from its invoker"
+  
+     this much makes perfect sense to me, but I don't understand how a
+     non-stateful, non-threaded function can then retain
+  
+     "this state across callees, ie, function local-variables are
+     retained on the stack across calls."
+  
+     how can it retain function-local values *across calls* when it
+     doesn't have any functional-local state?
+
+In the following example:
+
+  void foo() {
+     // local variables and code
+  }
+  void bar() {
+     // local variables
+     foo();
+  }
+
+bar is the caller and foo is the callee. bar borrows the program stack and
+thread to make the call to foo. When foo, the callee, is executing, bar's local
+variables (state) is retained on the *borrowed* stack across the call. (Note, I
+added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
+foo's local variables are also retain on the borrowed stack. When foo and bar
+return, all of their local state is gone (not retained). This behaviour is
+standard call/return semantics in an imperative language.
+
+
+     I'm not sure if I see two separate cases here - roughly equivalent
+     to C functions without static storage, and then C functions *with*
+     static storage.
+
+Yes, but there is only one instance of the static storage across all
+activations of the C function. For generators and coroutines, each instance has
+its own state, like an object in an OO language.
+
+
+     I assumed that was the distinction between cases 1 & 3; but perhaps the
+     actual distinction is that 3 has a suspend/resume point, and so the
+     "state" in figure 1 is this component of execution state (viz figs 1 & 2),
+     not the state representing the cross-call variables?
+
+So case 3 is like an object with the added ability to retain where it was last
+executing.  When a generator is resumed, the generator object (structure
+instance) is passed as an explicit reference, and within this object is the
+restart location in the generator's "main". When the generator main executes,
+it uses the borrowed stack for its local variables and any functions it calls,
+just like an object member borrows the stack for its local variables but also
+has an implicit receiver to the object state.  A generator can have static
+storage, too, which is a single instance across all generator instances of that
+type, as for static storage in an object type. All the kinds of storage are
+at play with semantics that is virtually the same as in other languages.
+
+
+    > but such evaluation isn't appropriate for garbage-collected or JITTed
+    > languages like Java or Go.
+    For JITTed languages in particular, reporting peak performance needs to
+    "warm up" the JIT with a number of iterators before beginning
+    measurement. Actually for JIT's its even worse: see Edd Barrett et al
+    OOPSLA 2017.
+   
+Of our testing languages, only Java is JITTED. To ensure the Java test-programs
+correctly measured the specific feature, we consulted with Dave Dice at Oracle
+who works directly on the development of the Oracle JVM Just-in-Time
+Compiler. We modified our test programs based on his advise, and he validated
+our programs as correctly measuring the specified language feature. Hence, we
+have taken into account all issues related to performing benchmarks in JITTED
+languages.  Dave's help is recognized in the Acknowledgment section. Also, all
+the benchmark programs are publicly available for independent verification.
+
+
+   * footnote A - I've looked at various other papers & the website to try to
+     understand how "object-oriented" Cforall is - I'm still not sure.  This
+     footnote says Cforall has "virtuals" - presumably virtual functions,
+     i.e. dynamic dispatch - and inheritance: that really is OO as far as I
+     (and most OO people) are concerned.  For example Haskell doesn't have
+     inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
+     or things like Cecil and Dylan are considered OO even though they have
+     "multiple function parameters as receivers", lack "lexical binding between
+     a structure and set of functions", and don't have explicit receiver
+     invocation syntax.  Python has receiver syntax, but unlike Java or
+     Smalltalk or C++, method declarations still need to have an explicit
+     "self" receiver parameter.  Seems to me that Go, for example, is
+     more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
+     and an explicit receiver syntax but that's not determinative); while Rust
+     lacks dynamic dispatch built-in.  C is not OO as a language, but as you
+     say given it supports function pointers with structures, it does support
+     an OO programming style.
+   
+     This is why I again recommend just not buying into this fight: not making
+     any claims about whether Cforall is OO or is not - because as I see it,
+     the rest of the paper doesn't depend on whether Cforall is OO or not.
+     That said: this is just a recommendation, and I won't quibble over this
+     any further.
+
+We believe it is important to identify Cforall as a non-OO language because it
+heavily influences the syntax and semantics used to build its concurrency.
+Since many aspects of Cforall are not OO, the rest of the paper *does* depend
+on Cforall being identified as non-OO, otherwise readers would have
+significantly different expectations for the design. We believe your definition
+of OO is too broad, such as including C. Just because a programming language
+can support aspects of the OO programming style, does not make it OO. (Just
+because a duck can swim does not make it a fish.)
+
+Our definition of non-OO follows directly from the Wikipedia entry:
+
+  Object-oriented programming (OOP) is a programming paradigm based on the
+  concept of "objects", which can contain data, in the form of fields (often
+  known as attributes or properties), and code, in the form of procedures
+  (often known as methods). A feature of objects is an object's procedures that
+  can access and often modify the data fields of the object with which they are
+  associated (objects have a notion of "this" or "self").
+  https://en.wikipedia.org/wiki/Object-oriented_programming
+
+Cforall fails this definition as code cannot appear in an "object" and there is
+no implicit receiver. As well, Cforall, Go, and Rust do not have nominal
+inheritance and they not considered OO languages, e.g.:
+
+ "**Is Go an object-oriented language?** Yes and no. Although Go has types and
+ methods and allows an object-oriented style of programming, there is no type
+ hierarchy. The concept of "interface" in Go provides a different approach
+ that we believe is easy to use and in some ways more general. There are also
+ ways to embed types in other types to provide something analogous-but not
+ identical-to subclassing. Moreover, methods in Go are more general than in
+ C++ or Java: they can be defined for any sort of data, even built-in types
+ such as plain, "unboxed" integers. They are not restricted to structs (classes).
+ https://golang.org/doc/faq#Is_Go_an_object-oriented_language
+
+
+   * is a "monitor function" the same as a "mutex function"?
+     if so the paper should pick one term; if not, make the distinction clear.
+
+Fixed. Picked "mutex". Changed the language and all places in the paper.
+
+
+   * "As stated on line 1 because state declarations from the generator
+      type can be moved out of the coroutine type into the coroutine main"
+  
+      OK sure, but again: *why* would a programmer want to do that?
+      (Other than, I guess, to show the difference between coroutines &
+      generators?)  Perhaps another way to put this is that the first
+      para of 3.2 gives the disadvantages of coroutines vs-a-vs
+      generators, briefly describes the extended semantics, but never
+      actually says why a programmer may want those extended semantics,
+      or how they would benefit.  I don't mean to belabour the point,
+      but (generalist?) readers like me would generally benefit from
+      those kinds of discussions about each feature throughout the
+      paper: why might a programmer want to use them?
+
+On page 8, it states:
+
+  Having to manually create the generator closure by moving local-state
+  variables into the generator type is an additional programmer burden (removed
+  by the coroutine in Section 3.2). ...
+
+also these variables can now be refactored into helper function, where the
+helper function suspends at arbitrary call depth. So imagine a coroutine helper
+function that is only called occasionally within the coroutine but it has a
+large array that is retained across suspends within the helper function. For a
+generator, this large array has to be declared in the generator type enlarging
+each generator instance even through the array is only used occasionally.
+Whereas, the coroutine only needs the array allocated when needed. Now a
+coroutine has a stack which occupies storage, but the maximum stack size only
+needs to be the call chain allocating the most storage, where as the generator
+has a maximum size of all variable that could be created.
+
+
+    > p17 if the multiple-monitor entry procedure really is novel, write a paper
+    > about that, and only about that.
+    > We do not believe this is a practical suggestion.
+    * I'm honestly not trying to be snide here: I'm not an expert on monitor or
+      concurrent implementations. Brinch Hansen's original monitors were single
+      acquire; this draft does not cite any other previous work that I could
+      see. I'm not suggesting that the brief mention of this mechanism
+      necessarily be removed from this paper, but if this is novel (and a clear
+      advance over a classical OO monitor a-la Java which only acquires the
+      distinguished receiver) then that would be worth another paper in itself.
+
+First, to explain multiple-monitor entry in Cforall as a separate paper would
+require significant background on Cforall concurrency, which means repeating
+large sections of this paper. Second, it feels like a paper just on
+multiple-monitor entry would be a little thin, even if the capability is novel
+to the best of our knowledge. Third, we feel multiple-monitor entry springs
+naturally from the overarching tone in the paper that Cforall is a non-OO
+programming language, allowing multiple-mutex receivers.
+
+
+    My typo: the paper's conclusion should come at the end, after the
+    future work section.
+
+Combined into a Conclusions and Future Work section.
+
+
+
+Reviewing: 2
+
+    on the Boehm paper and whether code is "all sequential to the compiler": I
+    now understand the authors' position better and suspect we are in violent
+    agreement, except for whether it's appropriate to use the rather breezy
+    phrase "all sequential to the compiler". It would be straightforward to
+    clarify that code not using the atomics features is optimized *as if* it
+    were sequential, i.e. on the assumption of a lack of data races.
+
+Fixed, "as inline and library code is compiled as sequential without any
+explicit concurrent directive."
+
+
+    on the distinction between "mutual exclusion" and "synchronization": the
+    added citation does help, in that it makes a coherent case for the
+    definition the authors prefer. However, the text could usefully clarify
+    that this is a matter of definition not of fact, given especially that in
+    my assessment the authors' preferred definition is not the most common
+    one. (Although the mention of Hoare's apparent use of this definition is
+    one data point, countervailing ones are found in many contemporaneous or
+    later papers, e.g. Habermann's 1972 "Synchronization of Communicating
+    Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
+    eventcounts and sequencers" (CACM (22(2)) and so on.)
+
+Fixed, "We contend these two properties are independent, ...".
+
+With respect to the two papers, Habermann fundamentally agrees with our
+definitions, where the term mutual exclusion is the same but Habermann uses the
+term "communication" for our synchronization. However, the term "communication"
+is rarely used to mean synchronization. The fact that Habermann collectively
+calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
+state:
+
+  By mutual exclusion we mean any mechanism that forces the time ordering of
+  execution of pieces of code, called critical regions, in a system of
+  concurrent processes to be a total ordering.
+
+But there is no timing order for a critical region (which I assume means
+critical section); threads can arrive at any time in any order, where the
+mutual exclusion for the critical section ensures one thread executes at a
+time. Interestingly, Reed & Kanodia's mutual exclusion is Habermann's
+communication not mutual exclusion. These papers only buttress our contention
+about the confusion of these terms in the literature.
+
+
+    section 2 (an expanded version of what was previously section 5.9) lacks
+    examples and is generally obscure and allusory ("the most advanced feature"
+    -- name it! "in triplets" -- there is only one triplet!;
+
+Fixed.
+
+ These independent properties can be used to compose different language
+ features, forming a compositional hierarchy, where the combination of all
+ three is the most advanced feature, called a thread/task/process. While it is
+ possible for a language to only provide threads for composing
+ programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
+ solutions to certain classes of problems.
+
+
+    what are "execution locations"? "initialize" and "de-initialize"
+
+Fixed through an example at the end of the sentence.
+
+ \item[\newterm{execution state}:] is the state information needed by a
+ control-flow feature to initialize, manage compute data and execution
+ location(s), and de-initialize, \eg calling a function initializes a stack
+ frame including contained objects with constructors, manages local data in
+ blocks and return locations during calls, and de-initializes the frame by
+ running any object destructors and management operations.
+
+
+    what? "borrowed from the invoker" is a concept in need of explaining or at
+    least a fully explained example -- in what sense does a plain function
+    borrow" its stack frame?
+
+A function has no storage except for static storage; it gets its storage from
+the program stack. For a function to run it must borrow storage from somewhere.
+When the function returns, the borrowed storage is returned. Do you have a more
+appropriate word?
+
+    "computation only" as opposed to what?
+
+This is a term in concurrency for operations that compute without blocking,
+i.e., the operation starts with everything it needs to compute its result and
+runs to completion, blocking only when it is done and returns its result.
+
+
+    in 2.2, in what way is a "request" fundamental to "synchronization"?
+
+I assume you are referring to the last bullet.
+
+ Synchronization must be able to control the service order of requests
+ including prioritizing selection from different kinds of outstanding requests,
+ and postponing a request for an unspecified time while continuing to accept
+ new requests.
+
+Habermann states on page 173
+
+ Looking at deposit we see that sender and receiver should be synchronized with
+ respect to buffer overflow: deposit of another message must be delayed if
+ there is no empty message frame, and such delay should be removed when the
+ first empty frame becomes available.  This is programmed as
+ synchronization(frame) : deposit is preceded by wait(frame), accept is
+ followed by signal( frame), and the constant C[frame] is set equal to
+ "bufsize."
+
+Here synchronization is controlling the service order of requests: when the
+buffer is full, requests for insert (sender) are postponed until a request to
+remove (receiver) occurs.  Without the ability to control service order among
+requests, the producer/consumer problem with a bounded buffer cannot be
+solved. Hence, this capability is fundamental.
+
+
+    and the "implicitly" versus "explicitly" point needs stating as elsewhere,
+    with a concrete example e.g. Java built-in mutexes versus
+    java.util.concurrent).
+
+Fixed.
+
+ MES must be available implicitly in language constructs, \eg Java built-in
+ monitors, as well as explicitly for specialized requirements, \eg
+ @java.util.concurrent@, because requiring programmers to build MES using
+ low-level locks often leads to incorrect programs.
+
+
+    section 6: 6.2 omits the most important facts in preference for otherwise
+    inscrutable detail: "identify the kind of parameter" (first say *that there
+    are* kinds of parameter, and what "kinds" means!); "mutex parameters are
+    documentation" is misleading (they are also semantically significant!) and
+    fails to say *what* they mean; the most important thing is surely that
+    'mutex' is a language feature for performing lock/unlock operations at
+    function entry/exit. So say it!
+
+These sections have been rewritten address the comments.
+
+
+    The meanings of examples f3 and f4 remain unclear.
+
+Rewrote the paragraph.
+
+
+    Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
+    meaning from Figure 12,
+
+Defined Hoare's urgent list at the start of the paragraph.
+
+
+    but that Figure is incomprehensible to me), and we are told of "external
+    scheduling"'s long history in Ada but not clearly what it actually means;
+
+We do not know how to address your comment because the description is clear to
+us and other non-reviewers who have read the paper. I had forgotten an
+important citation to prior work on this topic, which is now referenced:
+
+   Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
+   Classification. ACM Computing Surveys. 1995; 27(1):63-107.
+
+This citation is a 45 page paper expanding on the topic of internal scheduling.
+I also added a citation to Hoare's monitor paper where signal_block is defined:
+
+  When a process signals a condition on which another process is waiting, the
+  signalling process must wait until the resumed process permits it to
+  proceed.
+
+External scheduling from Ada was subsequently added to uC++ and Cforall.
+Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
+external-scheduling with Go channel external-scheduling. So external scheduling
+exists in languages beyond Ada.
+
+
+    6.4's description of "waitfor" tells us it is different from an if-else
+    chain but tries to use two *different* inputs to tell us that the behavior
+    is different; tell us an instance where *the same* values of C1 and C2 give
+    different behavior (I even wrote out a truth table and still don't see the
+    semantic difference)
+
+Again, it is unclear what is the problem. For the if-statement, if C1 is true,
+it only waits for a call to mem1, even if C2 is true and there is an
+outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
+is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
+happen with the if-statement. So for all true when clauses, any outstanding
+call is immediately accepted. If there are no outstanding calls, the waitfor
+blocks until the next call to any of the true when clauses occurs. I added the
+following sentence to further clarify.
+
+ Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
+
+Note, the same parallel semantics exists for the Go select statement with
+respect to waiting for a set of channels to receive data. While Go select does
+not have a when clause, it would be trivial to add it, making the select more
+expressive.
+
+
+    The authors frequently use bracketed phrases, and sometimes slashes "/", in
+    ways that are confusing and/or detrimental to readability.  Page 13 line
+    2's "forward (backward)" is one particularly egregious example.  In general
+    I would recommend the the authors try to limit their use of parentheses and
+    slashes as a means of forcing a clearer wording to emerge.
+
+Many of the slashes and parentheticals have been removed. Some are retained to
+express joined concepts: call/return, suspend/resume, resume/resume, I/O.
+
+
+    Also, the use of "eg." is often cursory and does not explain the examples
+    given, which are frequently a one- or two-word phrase of unclear referent.
+
+A few of these are fixed.
+
+
+    Considering the revision more broadly, none of the more extensive or
+    creative rewrites I suggested in my previous review have been attempted,
+    nor any equivalent efforts to improve its readability.
+
+If you reread the previous response, we addressed all of your suggestions except
+
+        An expositional idea occurs: start the paper with a strawman
+        naive/limited realisation of coroutines -- say, Simon Tatham's popular
+        "Coroutines in C" web page -- and identify point by point what the
+        limitations are and how C\/ overcomes them. Currently the presentation
+        is often flat (lacking motivating contrasts) and backwards (stating
+        solutions before problems). The foregoing approach might fix both of
+        these.
+    
+    We prefer the current structure of our paper and believe the paper does
+    explain basic coding limitations and how they are overcome in using
+    high-level control-floe mechanisms.
+
+and we have addressed readability issues in this version.
+
+
+    The hoisting of the former section 5.9 is a good idea, but the newly added
+    material accompanying it (around Table 1) suffers fresh deficiencies in
+    clarity. Overall the paper is longer than before, even though (as my
+    previous review stated), I believe a shorter paper is required in order to
+    serve the likely purpose of publication. (Indeed, the authors' letter
+    implies that a key goal of publication is to build community and gain
+    external users.)
+
+This comment is the referee's opinion, which we do not agree with it. Our
+previous 35 page SP&E paper on Cforall:
+
+  Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
+  Programming Language Features to C. Softw. Pract. Exper. 2018;
+  48(12):2111-2146.
+
+has a similar structure and style to this paper, and it received an award from
+John Wiley & Sons:
+
+ Software: Practice & Experience for articles published between January 2017
+ and December 2018, "most downloads in the 12 months following online
+ publication showing the work generated immediate impact and visibility,
+ contributing significantly to advancement in the field".
+
+So we have demonstrated an ability to build community and gain external users.
+
+  
+    Given this trajectory, I no longer see a path to an acceptable revision of
+    the present submission. Instead I suggest the authors consider splitting
+    the paper in two: one half about coroutines and stack management, the other
+    about mutexes, monitors and the runtime. (A briefer presentation of the
+    runtime may be helpful in the first paper also, and a brief recap of the
+    generator and coroutine support is obviously needed in the second too.)
+
+Again we disagree with the referee's suggestion to vastly restructure the
+paper. What advantage is there is presenting exactly the same material across
+two papers, which will likely end up longer than a single paper? The current
+paper has a clear theme that fundamental execution properties generate a set of
+basic language mechanisms, and we then proceed to show how these mechanisms can
+be designed into the programing language Cforall.
+
+
+    I do not buy the authors' defense of the limited practical experience or
+    "non-micro" benchmarking presented. Yes, gaining external users is hard and
+    I am sympathetic on that point. But building something at least *somewhat*
+    substantial with your own system should be within reach, and without it the
+    "practice and experience" aspects of the work have not been explored.
+    Clearly C\/ is the product of a lot of work over an extended period, so it
+    is a surprise that no such experience is readily available for inclusion.
+
+Understood. There are no agreed-upon concurrency benchmarks, which is why
+micro-benchmarks are often used. Currently, the entire Cforall runtime is
+written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
+be thread safe, automatically detects the use of concurrency features at link
+time, and bootstraps into a threaded runtime almost immediately at program
+startup so threads can be declared as global variables and may run to
+completion before the program main starts. The concurrent core of the runtime
+is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
+and high-level features. In other words, the concurrent core uses itself as
+quickly as possible to start using high-level concurrency. There are 12,000+
+LOC in the Cforall tests-suite used to verify language features, which are run
+nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
+as aggressively testing each language feature, and classical examples such as
+bounded buffer, dating service, matrix summation, quickSort, binary insertion
+sort, etc.  More experience will be available soon, based on ongoing work in
+the "future works" section. Specifically, non-blocking I/O is working with the
+new Linux io_uring interface and a new high-performance ready-queue is under
+construction to take into account this change. With non-blocking I/O, it will
+be possible to write applications like high-performance web servers, as is now
+done in Rust and Go. Also, completed is Java-style executors for work-based
+concurrent programming and futures. Under construction is a high-performance
+actor system.
+
+
+    It does not seem right to state that a stack is essential to Von Neumann
+    architectures -- since the earliest Von Neumann machines (and indeed early
+    Fortran) did not use one.
+
+Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
+https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
+
+  Since a subprogram may call for other subprograms to any desired depth, a
+  particular CALL statement may be defined by a pyramid of multi-level
+  subprograms.
+
+I think we may be differing on the meaning of stack. You may be imagining a
+modern stack that grows and shrink dynamically. Whereas early Fortran
+preallocated a stack frame for each function, like Python allocates a frame for
+a generator.  Within each preallocated Fortran function is a frame for local
+variables and a pointer to store the return value for a call.  The Fortran
+call/return mechanism than uses these frames to build a traditional call stack
+linked by the return pointer. The only restriction is that a function stack
+frame can only be used once, implying no direct or indirect recursion.  Hence,
+without a stack mechanism, there can be no call/return to "any desired depth",
+where the maximum desired depth is limited by the number of functions. So
+call/return requires some form of a stack, virtually all programming language
+have call/return, past and present, and these languages run on Von Neumann
+machines that do not distinguish between program and memory space, have mutable
+state, and the concept of a pointer to data or code.
+
+
+    To elaborate on something another reviewer commented on: it is a surprise
+    to find a "Future work" section *after* the "Conclusion" section. A
+    "Conclusions and future work" section often works well.
+
+Done.
+
+
+
+Reviewing: 3
+
+    but it remains really difficult to have a good sense of which idea I should
+    use and when. This applies in different ways to different features from the
+    language:
+
+    * coroutines/generators/threads: here there is some discussion, but it can
+      be improved.
+    * interal/external scheduling: I didn't find any direct comparison between
+      these features, except by way of example.
+
+See changes below.
+
+
+    I would have preferred something more like a table or a few paragraphs
+    highlighting the key reasons one would pick one construct or the other.
+
+Section 2.1 is changed to address this suggestion.
+
+
+    The discussion of clusters and pre-emption in particular feels quite rushed.
+
+We believe a brief introduction to the Cforall runtime structure is important
+because clustering within a user-level versus distributed system is unusual,
+Furthermore, the explanation of preemption is important because several new
+languages, like Go and Rust tokio, are not preemptive. Rust threads are
+preemptive only because it uses kernel threads, which UNIX preempts.
+
+
+    * Recommend to shorten the comparison on coroutine/generator/threads in
+      Section 2 to a paragraph with a few examples, or possibly a table
+      explaining the trade-offs between the constructs
+
+Not done, see below.
+
+
+    * Recommend to clarify the relationship between internal/external
+      scheduling -- is one more general but more error-prone or low-level?
+
+Done, see below.
+
+
+    There is obviously a lot of overlap between these features, and in
+    particular between coroutines and generators. As noted in the previous
+    review, many languages have chosen to offer *only* generators, and to
+    build coroutines by stacks of generators invoking one another.
+
+As we point out, coroutines built from stacks of generators have problems, such
+as no symmetric control-flow. Furthermore, stacks of generators have a problem
+with the following programming pattern.  logger is a library function called
+from a function or a coroutine, where the doit for the coroutine suspends. With
+stacks of generators, there has to be a function and generator version of
+logger to support these two scenarios. If logger is a library function, it may
+be impossible to create the generator logger because the logger function is
+opaque.
+
+  #include <fstream.hfa>
+  #include <coroutine.hfa>
+
+  forall( otype T | { void doit( T ); } )
+  void logger( T & t ) {
+      doit( t );
+  }
+
+  coroutine C {};
+  void main( C & c ) with( c ) {
+      void doit( C & c ) { suspend; }
+      logger( c );
+  }
+  void mem( C & c ) {
+      resume( c );
+  }
+
+  int main() {
+      C c;
+      mem( c );
+      mem( c );
+  
+      struct S {};
+      S s;
+      void doit( S & s ) {}
+      logger( s );
+  }
+
+
+
+    In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
+    that embodies this "top down" approach. It starts, "programmers can now
+    answer three basic questions", and thus gives some practical advice for
+    which construct you should use and when. I think giving some examples of
+    specific applications that this paragraph, combined with some examples of
+    cases where each construct was needed, would be a better approach.
+
+    I don't think this comparison needs to be very long. It seems clear enough
+    that one would
+
+    * prefer generators for simple computations that yield up many values,
+
+This description does not cover output or matching generators that do not yield
+many or any values. For example, the output generator Fmt yields no value; the
+device driver yields a value occasionally once a message is found. Furthermore,
+real device drivers are not simple; there can have hundreds of states and
+transitions. Imagine the complex event-engine for a web-server written as a
+generator.
+
+    * prefer coroutines for more complex processes that have significant
+      internal structure,
+
+As for generators, complexity is not the criterion for selection. A coroutine
+brings generality to the implementation because of the addition stack, whereas
+generators have restrictions on standard software-engining practises: variable
+placement, no helper functions without creating an explicit generator stack,
+and no symmetric control-flow. Whereas, the producer/consumer example in Figure
+7 uses stack variable placement, helpers, and simple ping/pong-style symmetric
+control-flow.
+
+    * prefer threads for cases where parallel execution is desired or needed.
+
+Agreed, but this description does not mention mutual exclusion and
+synchronization, which is essential in any meaningful concurrent program.
+
+Our point here is to illustrate that a casual "top down" explanation is
+insufficient to explain the complexity of the underlying execution properties.
+We presented some rule-of-thumbs at the end of Section 2 but programmers must
+understand all the underlying mechanisms and their interactions to exploit the
+execution properties to their fullest, and to understand when a programming
+language does or does not provide a desired mechanism.
+
+
+    I did appreciate the comparison in Section 2.3 between async-await in
+    JS/Java and generators/coroutines. I agree with its premise that those
+    mechanisms are a poor replacement for generators (and, indeed, JS has a
+    distinct generator mechanism, for example, in part for this reason).  I
+    believe I may have asked for this in a previous review, but having read it,
+    I wonder if it is really necessary, since those mechanisms are so different
+    in purpose.
+
+Given that you asked about this it before, I believe other readers might also
+ask the same question because async-await is very popular. So I think this
+section does help to position the work in the paper among other work, and
+hence, it is appropriate to keep it in the paper.
+
+
+    I find the motivation for supporting both internal and external scheduling
+    to be fairly implicit. After several reads through the section, I came to
+    the conclusion that internal scheduling is more expressive than external
+    scheduling, but sometimes less convenient or clear. Is this correct? If
+    not, it'd be useful to clarify where external scheduling is more
+    expressive.
+
+    I would find it very interesting to try and capture some of the properties
+    that make internal vs external scheduling the better choice.
+
+    For example, it seems to me that external scheduling works well if there
+    are only a few "key" operations, but that internal scheduling might be
+    better otherwise, simply because it would be useful to have the ability to
+    name a signal that can be referenced by many methods.
+
+To address this point, the last paragraph on page 22 (now page 23) has been
+augmented to the following:
+
+ Given external and internal scheduling, what guidelines can a programmer use
+ to select between them?  In general, external scheduling is easier to
+ understand and code because only the next logical action (mutex function(s))
+ is stated, and the monitor implicitly handles all the details.  Therefore,
+ there are no condition variables, and hence, no wait and signal, which reduces
+ coding complexity and synchronization errors.  If external scheduling is
+ simpler than internal, why not use it all the time?  Unfortunately, external
+ scheduling cannot be used if: scheduling depends on parameter value(s) or
+ scheduling must block across an unknown series of calls on a condition
+ variable (\ie internal scheduling).  For example, the dating service cannot be
+ written using external scheduling.  First, scheduling requires knowledge of
+ calling parameters to make matching decisions and parameters of calling
+ threads are unavailable within the monitor.  Specifically, a thread within the
+ monitor cannot examine the @ccode@ of threads waiting on the calling queue to
+ determine if there is a matching partner.  (Similarly, if the bounded buffer
+ or readers/writer are restructured with a single interface function with a
+ parameter denoting producer/consumer or reader/write, they cannot be solved
+ with external scheduling.)  Second, a scheduling decision may be delayed
+ across an unknown number of calls when there is no immediate match so the
+ thread in the monitor must block on a condition.  Specifically, if a thread
+ determines there is no opposite calling thread with the same @ccode@, it must
+ wait an unknown period until a matching thread arrives.  For complex
+ synchronization, both external and internal scheduling can be used to take
+ advantage of best of properties of each.
+
+
+    Consider the bounded buffer from Figure 13: if it had multiple methods for
+    removing elements, and not just `remove`, then the `waitfor(remove)` call
+    in `insert` might not be sufficient.
+
+Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
+your request:
+
+  waitfor( remove : buffer ); or waitfor( remove2 : buffer );
+
+and its shorthand form (not shown in the paper)
+
+  waitfor( remove, remove2 : t );
+
+A call to one these remove functions satisfies the waitfor (exact selection
+details are discussed in Section 6.4).
+
+
+    The same is true, I think, of the `signal_block` function, which I
+    have not encountered before;
+
+In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
+Structuring Concept", it states on page 551:
+
+ When a process signals a condition on which another process is waiting, the
+ signalling process must wait until the resumed process permits it to
+ proceed. We therefore introduce for each monitor a second semaphore "urgent"
+ (initialized to 0), on which signalling processes suspend themselves by the
+ operation P(urgent).
+
+Hence, the original definition of signal is in fact signal_block, i.e., the
+signaller blocks. Later implementations of monitor switched to signaller
+nonblocking because most signals occur before returns, which allows the
+signaller to continue execution, exit the monitor, and run concurrently with
+the signalled thread that restarts in the monitor. When the signaller is not
+going to exit immediately, signal_block is appropriate.
+
+
+    it seems like its behavior can be modeled with multiple condition
+    variables, but that's clearly more complex.
+
+Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
+Surveys, 27(1):63-107, that all extant monitors with different signalling
+semantics can be transformed into each other. However, some transformations are
+complex and runtime expensive.
+
+
+    One question I had about `signal_block`: what happens if one signals
+    but no other thread is waiting? Does it block until some other thread
+    waits? Or is that user error?
+
+On page 20, it states:
+
+  Signalling is unconditional because signalling an empty condition queue does
+  nothing.
+
+To the best of our knowledge, all monitors have the same semantics for
+signalling an empty condition queue, regardless of the kind of signal, i.e.,
+signal or signal_block.
+
+    I believe that one difference between the Go program and the Cforall
+    equivalent is that the Goroutine has an associated queue, so that
+    multiple messages could be enqueued, whereas the Cforall equivalent is
+    effectively a "bounded buffer" of length 1. Is that correct?
+
+Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
+send so both are synchronous communication.
+
+    I think this should be stated explicitly. (Presumably, one could modify the
+    Cforall program to include an explicit vector of queued messages if
+    desired, but you would also be reimplementing the channel abstraction.)
+
+Fixed, by adding the following sentences:
+
+  The different between call and channel send occurs for buffered channels
+  making the send asynchronous.  In \CFA, asynchronous call and multiple
+  buffers is provided using an administrator and worker
+  threads~\cite{Gentleman81} and/or futures (not discussed).
+
+
+    Also, in Figure 20, I believe that there is a missing `mutex` keyword.
+    
+Fixed.
+
+
+    I was glad to see that the paper acknowledged that Cforall still had
+    low-level atomic operations, even if their use is discouraged in favor of
+    higher-level alternatives.
+
+There was never an attempt to not acknowledged that Cforall had low-level
+atomic operations. The original version of the paper stated:
+
+  6.6 Low-level Locks
+  For completeness and efficiency, Cforall provides a standard set of low-level
+  locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
+  instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
+
+and that section is still in the paper. In fact, we use these low-level
+mechanisms to build all of the high-level concurrency constructs in Cforall.
+  
+
+    However, I still feel that the conclusion overstates the value of the
+    contribution here when it says that "Cforall high-level race-free monitors
+    and threads provide the core mechanisms for mutual exclusion and
+    synchronization, without the need for volatile and atomics". I feel
+    confident that Java programmers, for example, would be advised to stick
+    with synchronized methods whenever possible, and it seems to me that they
+    offer similar advantages -- but they sometimes wind up using volatiles for
+    performance reasons.
+
+I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
+programs can achieve very good performance without volatile or atomics because
+the runtime system has already used these low-level capabilities to build an
+efficient set of high-level concurrency constructs.
+
+0.1% of the time programmers need to build their own locks and synchronization
+mechanisms. This need also occurs for storage management. Both of these
+mechanisms are allowed in Cforall but are fraught with danger and should be
+discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
+fencing for a WSO memory model, such as on the ARM. It doesn't help that the
+C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
+Dave Dice and me would agree that 99% of hand-crafted locks created by
+programmers are broken and/or non-portable.
+
+
+    I was also confused by the term "race-free" in that sentence. In
+    particular, I don't think that Cforall has any mechanisms for preventing
+    *data races*, and it clearly doesn't prevent "race conditions" (which would
+    bar all sorts of useful programs). I suppose that "race free" here might be
+    referring to the improvements such as removing barging behavior.
+
+We use the term "race free" to mean the same as Boehm/Adve's "data-race
+freedom" in
+
+  Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
+  Memory Models. Communications ACM. 2012; 55(2):48-54.
+  https://queue.acm.org/detail.cfm?id=2088916
+
+which is cited in the paper. Furthermore, we never said that Cforall has
+mechanisms for preventing *all* data races. We said Cforall high-level
+race-free monitors and threads[, when used with mutex access function] (added
+to the paper), implies no data races within these constructs, unless a
+programmer directly publishes shared state. This approach is exactly what
+Boehm/Adve advocate for the vast majority of concurrent programming.
+
+
+    It would perhaps be more interesting to see a comparison built using [tokio] or
+    [async-std], two of the more prominent user-space threading libraries that
+    build on Rust's async-await feature (which operates quite differently than
+    Javascript's async-await, in that it doesn't cause every aync function call to
+    schedule a distinct task).
+
+Done.
+
+
+    Several figures used the `with` keyword. I deduced that `with(foo)` permits
+    one to write `bar` instead of `foo.bar`. It seems worth introducing.
+    Apologies if this is stated in the paper, if so I missed it.
+
+Page 6, footnote F states:
+
+  The Cforall "with" clause opens an aggregate scope making its fields directly
+  accessible, like Pascal "with", but using parallel semantics; multiple
+  aggregates may be opened.
+
+
+    On page 20, section 6.3, "external scheduling and vice versus" should be
+    "external scheduling and vice versa".
+
+Fixed.
+
+
+    On page 5, section 2.3, the paper states "we content" but it should be "we
+    contend".
+
+Fixed.
+
+
+    Page 1. I don't believe that it s fair to imply that Scala is "research
+    vehicle" as it is used by major players, Twitter being the most prominent
+    example.
+
+Fixed. Removed "research".
+
+
+    Page 15. Must Cforall threads start after construction (e.g. see your
+    example on page 15, line 21)?
+
+Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
+programmers want threads to start immediately. (Most Java programs have no code
+between a thread declaration and the call to start the thread.)  Therefore,
+this semantic should be the default because (see page 13):
+
+  Alternatives, such as explicitly starting threads as in Java, are repetitive
+  and forgetting to call start is a common source of errors.
+
+To handle the other 5% of the cases, there is a trivial Cforall pattern
+providing Java-style start/join. The additional cost for this pattern is 2
+light-weight thread context-switches.
+
+  thread T {};
+  void start( T & mutex ) {} // any function name
+  void join( T & mutex ) {} // any function name
+  void main( T & t ) {
+      sout | "start";
+      waitfor( start : t ); // wait to be started
+      sout | "restart"; // perform work
+      waitfor( join : t ); // wait to be joined
+      sout | "join";
+  }
+  int main() {
+      T t[3]; // threads start and delay
+      sout | "need to start";
+      for ( i; 3 ) start( t[i] );
+      sout | "need to join";
+      for ( i; 3 ) join( t[i] );
+      sout | "threads stopped";
+  } // threads deleted from stack
+
+  $ a.out
+  need to start
+  start
+  start
+  start
+  need to join
+  restart
+  restart
+  restart
+  threads stopped
+  join
+  join
+  join
+
+
+    Page 18, line 17: is using
+
+Fixed.
