# source:doc/papers/concurrency/response2@04b4a71

Last change on this file since 04b4a71 was 04b4a71, checked in by Peter A. Buhr <pabuhr@…>, 2 years ago

update concurrency paper with referee changes and generate a response to the referee's report

• Property mode set to 100644
File size: 46.5 KB
Line
1Reviewing: 1
2
3    I still have a couple of issues --- perhaps the largest is that it's
4    still not clear at this point in the paper what some of these options
5    are, or crucially how they would be used. I don't know if it's
6    possible to give high-level examples or use cases to be clear about
7    these up front - or if that would duplicate too much information from
8    later in the paper - either way expanding out the discussion - even if
9    just two a couple of sentences for each row - would help me more.
10
11Section 2.1 is changed to address this suggestion.
12
13
14    * 1st para section 2 begs the question: why not support each
15      dimension independently, and let the programmer or library designer
16      combine features?
17
18As seen in Table 1, not all of the combinations work, and having programmers
19directly use these low-level mechanisms is error prone. Accessing these
20fundamental mechanisms through higher-level constructs has always been the
21purpose of a programming language.
22
23
24    * Why must there "be language mechanisms to create, block/unblock, and join
25      with a thread"?  There aren't in Smalltalk (although there are in the
26      runtime).  Especially given in Cforall those mechanisms are *implicit* on
28
29The best description of Smalltalk concurrency I can find is in J. Hunt,
30Smalltalk and Object Orientation, Springer-Verlag London Limited, 1997, Chapter
3131 Concurrency in Smalltalk. It states on page 332:
32
33  For a process to be spawned from the current process there must be some way
34  of creating a new process. This is done using one of four messages to a
35  block. These messages are:
36
37    aBlock fork: This creates and schedules a process which will execute the
38    block. The priority of this process is inherited from the parent process.
39    ...
40
41  The Semaphore class provides facilities for achieving simple synchronization,
42  it is simple because it only allows for two forms of communication signal and
43  wait.
44
45Hence, "aBlock fork" creates, "Semaphore" blocks/unblocks (as does message send
46to an aBlock object), and garbage collection of an aBlock joins with its
47thread. The fact that a programmer *implicitly* does "fork", "block"/"unblock",
48and "join", does not change their fundamental requirement.
49
50
51   * "Case 1 is a function that borrows storage for its state (stack
52     frame/activation) and a thread from its invoker"
53
54     this much makes perfect sense to me, but I don't understand how a
55     non-stateful, non-threaded function can then retain
56
57     "this state across callees, ie, function local-variables are
58     retained on the stack across calls."
59
60     how can it retain function-local values *across calls* when it
61     doesn't have any functional-local state?
62
63In the following example:
64
65  void foo() {
66     // local variables and code
67  }
68  void bar() {
69     // local variables
70     foo();
71  }
72
73bar is the caller and foo is the callee. bar borrows the program stack and
74thread to make the call to foo. When foo, the callee, is executing, bar's local
75variables (state) is retained on the *borrowed* stack across the call. (Note, I
76added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
77foo's local variables are also retain on the borrowed stack. When foo and bar
78return, all of their local state is gone (not retained). This behaviour is
79standard call/return semantics in an imperative language.
80
81
82     I'm not sure if I see two separate cases here - roughly equivalent
83     to C functions without static storage, and then C functions *with*
84     static storage.
85
86Yes, but there is only one instance of the static storage across all
87activations of the C function. For generators and coroutines, each instance has
88its own state, like an object in an OO language.
89
90
91     I assumed that was the distinction between cases 1 & 3; but perhaps the
92     actual distinction is that 3 has a suspend/resume point, and so the
93     "state" in figure 1 is this component of execution state (viz figs 1 & 2),
94     not the state representing the cross-call variables?
95
96So case 3 is like an object with the added ability to retain where it was last
97executing.  When a generator is resumed, the generator object (structure
98instance) is passed as an explicit reference, and within this object is the
99restart location in the generator's "main". When the generator main executes,
100it uses the borrowed stack for its local variables and any functions it calls,
101just like an object member borrows the stack for its local variables but also
102has an implicit receiver to the object state.  A generator can have static
103storage, too, which is a single instance across all generator instances of that
104type, as for static storage in an object type. All the kinds of storage are
105at play with semantics that is virtually the same as in other languages.
106
107
108    > but such evaluation isn't appropriate for garbage-collected or JITTed
109    > languages like Java or Go.
110    For JITTed languages in particular, reporting peak performance needs to
111    "warm up" the JIT with a number of iterators before beginning
112    measurement. Actually for JIT's its even worse: see Edd Barrett et al
113    OOPSLA 2017.
114
115Of our testing languages, only Java is JITTED. To ensure the Java test-programs
116correctly measured the specific feature, we consulted with Dave Dice at Oracle
117who works directly on the development of the Oracle JVM Just-in-Time
118Compiler. We modified our test programs based on his advise, and he validated
119our programs as correctly measuring the specified language feature. Hence, we
120have taken into account all issues related to performing benchmarks in JITTED
121languages.  Dave's help is recognized in the Acknowledgment section. Also, all
122the benchmark programs are publicly available for independent verification.
123
124
125   * footnote A - I've looked at various other papers & the website to try to
126     understand how "object-oriented" Cforall is - I'm still not sure.  This
127     footnote says Cforall has "virtuals" - presumably virtual functions,
128     i.e. dynamic dispatch - and inheritance: that really is OO as far as I
129     (and most OO people) are concerned.  For example Haskell doesn't have
130     inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
131     or things like Cecil and Dylan are considered OO even though they have
132     "multiple function parameters as receivers", lack "lexical binding between
133     a structure and set of functions", and don't have explicit receiver
134     invocation syntax.  Python has receiver syntax, but unlike Java or
135     Smalltalk or C++, method declarations still need to have an explicit
136     "self" receiver parameter.  Seems to me that Go, for example, is
137     more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
138     and an explicit receiver syntax but that's not determinative); while Rust
139     lacks dynamic dispatch built-in.  C is not OO as a language, but as you
140     say given it supports function pointers with structures, it does support
141     an OO programming style.
142
143     This is why I again recommend just not buying into this fight: not making
144     any claims about whether Cforall is OO or is not - because as I see it,
145     the rest of the paper doesn't depend on whether Cforall is OO or not.
146     That said: this is just a recommendation, and I won't quibble over this
147     any further.
148
149We believe it is important to identify Cforall as a non-OO language because it
150heavily influences the syntax and semantics used to build its concurrency.
151Since many aspects of Cforall are not OO, the rest of the paper *does* depend
152on Cforall being identified as non-OO, otherwise readers would have
153significantly different expectations for the design. We believe your definition
154of OO is too broad, such as including C. Just because a programming language
155can support aspects of the OO programming style, does not make it OO. (Just
156because a duck can swim does not make it a fish.)
157
158Our definition of non-OO follows directly from the Wikipedia entry:
159
160  Object-oriented programming (OOP) is a programming paradigm based on the
161  concept of "objects", which can contain data, in the form of fields (often
162  known as attributes or properties), and code, in the form of procedures
163  (often known as methods). A feature of objects is an object's procedures that
164  can access and often modify the data fields of the object with which they are
165  associated (objects have a notion of "this" or "self").
166  https://en.wikipedia.org/wiki/Object-oriented_programming
167
168Cforall fails this definition as code cannot appear in an "object" and there is
169no implicit receiver. As well, Cforall, Go, and Rust do not have nominal
170inheritance and they not considered OO languages, e.g.:
171
172 "**Is Go an object-oriented language?** Yes and no. Although Go has types and
173 methods and allows an object-oriented style of programming, there is no type
174 hierarchy. The concept of "interface" in Go provides a different approach
175 that we believe is easy to use and in some ways more general. There are also
176 ways to embed types in other types to provide something analogous-but not
177 identical-to subclassing. Moreover, methods in Go are more general than in
178 C++ or Java: they can be defined for any sort of data, even built-in types
179 such as plain, "unboxed" integers. They are not restricted to structs (classes).
180 https://golang.org/doc/faq#Is_Go_an_object-oriented_language
181
182
183   * is a "monitor function" the same as a "mutex function"?
184     if so the paper should pick one term; if not, make the distinction clear.
185
186Fixed. Picked "mutex". Changed the language and all places in the paper.
187
188
189   * "As stated on line 1 because state declarations from the generator
190      type can be moved out of the coroutine type into the coroutine main"
191
192      OK sure, but again: *why* would a programmer want to do that?
193      (Other than, I guess, to show the difference between coroutines &
194      generators?)  Perhaps another way to put this is that the first
195      para of 3.2 gives the disadvantages of coroutines vs-a-vs
196      generators, briefly describes the extended semantics, but never
197      actually says why a programmer may want those extended semantics,
198      or how they would benefit.  I don't mean to belabour the point,
199      but (generalist?) readers like me would generally benefit from
200      those kinds of discussions about each feature throughout the
201      paper: why might a programmer want to use them?
202
203On page 8, it states:
204
205  Having to manually create the generator closure by moving local-state
206  variables into the generator type is an additional programmer burden (removed
207  by the coroutine in Section 3.2). ...
208
209also these variables can now be refactored into helper function, where the
210helper function suspends at arbitrary call depth. So imagine a coroutine helper
211function that is only called occasionally within the coroutine but it has a
212large array that is retained across suspends within the helper function. For a
213generator, this large array has to be declared in the generator type enlarging
214each generator instance even through the array is only used occasionally.
215Whereas, the coroutine only needs the array allocated when needed. Now a
216coroutine has a stack which occupies storage, but the maximum stack size only
217needs to be the call chain allocating the most storage, where as the generator
218has a maximum size of all variable that could be created.
219
220
221    > p17 if the multiple-monitor entry procedure really is novel, write a paper
223    > We do not believe this is a practical suggestion.
224    * I'm honestly not trying to be snide here: I'm not an expert on monitor or
225      concurrent implementations. Brinch Hansen's original monitors were single
226      acquire; this draft does not cite any other previous work that I could
227      see. I'm not suggesting that the brief mention of this mechanism
228      necessarily be removed from this paper, but if this is novel (and a clear
229      advance over a classical OO monitor a-la Java which only acquires the
230      distinguished receiver) then that would be worth another paper in itself.
231
232First, to explain multiple-monitor entry in Cforall as a separate paper would
233require significant background on Cforall concurrency, which means repeating
234large sections of this paper. Second, it feels like a paper just on
235multiple-monitor entry would be a little thin, even if the capability is novel
236to the best of our knowledge. Third, we feel multiple-monitor entry springs
237naturally from the overarching tone in the paper that Cforall is a non-OO
239
240
241    My typo: the paper's conclusion should come at the end, after the
242    future work section.
243
244Combined into a Conclusions and Future Work section.
245
246
247
248Reviewing: 2
249
250    on the Boehm paper and whether code is "all sequential to the compiler": I
251    now understand the authors' position better and suspect we are in violent
252    agreement, except for whether it's appropriate to use the rather breezy
253    phrase "all sequential to the compiler". It would be straightforward to
254    clarify that code not using the atomics features is optimized *as if* it
255    were sequential, i.e. on the assumption of a lack of data races.
256
257Fixed, "as inline and library code is compiled as sequential without any
258explicit concurrent directive."
259
260
261    on the distinction between "mutual exclusion" and "synchronization": the
262    added citation does help, in that it makes a coherent case for the
263    definition the authors prefer. However, the text could usefully clarify
264    that this is a matter of definition not of fact, given especially that in
265    my assessment the authors' preferred definition is not the most common
266    one. (Although the mention of Hoare's apparent use of this definition is
267    one data point, countervailing ones are found in many contemporaneous or
268    later papers, e.g. Habermann's 1972 "Synchronization of Communicating
269    Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
270    eventcounts and sequencers" (CACM (22(2)) and so on.)
271
272Fixed, "We contend these two properties are independent, ...".
273
274With respect to the two papers, Habermann fundamentally agrees with our
275definitions, where the term mutual exclusion is the same but Habermann uses the
276term "communication" for our synchronization. However, the term "communication"
277is rarely used to mean synchronization. The fact that Habermann collectively
278calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
279state:
280
281  By mutual exclusion we mean any mechanism that forces the time ordering of
282  execution of pieces of code, called critical regions, in a system of
283  concurrent processes to be a total ordering.
284
285But there is no timing order for a critical region (which I assume means
286critical section); threads can arrive at any time in any order, where the
287mutual exclusion for the critical section ensures one thread executes at a
288time. Interestingly, Reed & Kanodia's mutual exclusion is Habermann's
289communication not mutual exclusion. These papers only buttress our contention
290about the confusion of these terms in the literature.
291
292
293    section 2 (an expanded version of what was previously section 5.9) lacks
294    examples and is generally obscure and allusory ("the most advanced feature"
295    -- name it! "in triplets" -- there is only one triplet!;
296
297Fixed.
298
299 These independent properties can be used to compose different language
300 features, forming a compositional hierarchy, where the combination of all
302 possible for a language to only provide threads for composing
303 programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
304 solutions to certain classes of problems.
305
306
307    what are "execution locations"? "initialize" and "de-initialize"
308
309Fixed through an example at the end of the sentence.
310
311 \item[\newterm{execution state}:] is the state information needed by a
312 control-flow feature to initialize, manage compute data and execution
313 location(s), and de-initialize, \eg calling a function initializes a stack
314 frame including contained objects with constructors, manages local data in
315 blocks and return locations during calls, and de-initializes the frame by
316 running any object destructors and management operations.
317
318
319    what? "borrowed from the invoker" is a concept in need of explaining or at
320    least a fully explained example -- in what sense does a plain function
321    borrow" its stack frame?
322
323A function has no storage except for static storage; it gets its storage from
324the program stack. For a function to run it must borrow storage from somewhere.
325When the function returns, the borrowed storage is returned. Do you have a more
326appropriate word?
327
328    "computation only" as opposed to what?
329
330This is a term in concurrency for operations that compute without blocking,
331i.e., the operation starts with everything it needs to compute its result and
332runs to completion, blocking only when it is done and returns its result.
333
334
335    in 2.2, in what way is a "request" fundamental to "synchronization"?
336
337I assume you are referring to the last bullet.
338
339 Synchronization must be able to control the service order of requests
340 including prioritizing selection from different kinds of outstanding requests,
341 and postponing a request for an unspecified time while continuing to accept
342 new requests.
343
344Habermann states on page 173
345
346 Looking at deposit we see that sender and receiver should be synchronized with
347 respect to buffer overflow: deposit of another message must be delayed if
348 there is no empty message frame, and such delay should be removed when the
349 first empty frame becomes available.  This is programmed as
350 synchronization(frame) : deposit is preceded by wait(frame), accept is
351 followed by signal( frame), and the constant C[frame] is set equal to
352 "bufsize."
353
354Here synchronization is controlling the service order of requests: when the
355buffer is full, requests for insert (sender) are postponed until a request to
356remove (receiver) occurs.  Without the ability to control service order among
357requests, the producer/consumer problem with a bounded buffer cannot be
358solved. Hence, this capability is fundamental.
359
360
361    and the "implicitly" versus "explicitly" point needs stating as elsewhere,
362    with a concrete example e.g. Java built-in mutexes versus
363    java.util.concurrent).
364
365Fixed.
366
367 MES must be available implicitly in language constructs, \eg Java built-in
368 monitors, as well as explicitly for specialized requirements, \eg
369 @java.util.concurrent@, because requiring programmers to build MES using
370 low-level locks often leads to incorrect programs.
371
372
373    section 6: 6.2 omits the most important facts in preference for otherwise
374    inscrutable detail: "identify the kind of parameter" (first say *that there
375    are* kinds of parameter, and what "kinds" means!); "mutex parameters are
376    documentation" is misleading (they are also semantically significant!) and
377    fails to say *what* they mean; the most important thing is surely that
378    'mutex' is a language feature for performing lock/unlock operations at
379    function entry/exit. So say it!
380
382
383
384    The meanings of examples f3 and f4 remain unclear.
385
386Rewrote the paragraph.
387
388
389    Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
390    meaning from Figure 12,
391
392Defined Hoare's urgent list at the start of the paragraph.
393
394
395    but that Figure is incomprehensible to me), and we are told of "external
396    scheduling"'s long history in Ada but not clearly what it actually means;
397
398We do not know how to address your comment because the description is clear to
399us and other non-reviewers who have read the paper. I had forgotten an
400important citation to prior work on this topic, which is now referenced:
401
402   Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
403   Classification. ACM Computing Surveys. 1995; 27(1):63-107.
404
405This citation is a 45 page paper expanding on the topic of internal scheduling.
406I also added a citation to Hoare's monitor paper where signal_block is defined:
407
408  When a process signals a condition on which another process is waiting, the
409  signalling process must wait until the resumed process permits it to
410  proceed.
411
413Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
414external-scheduling with Go channel external-scheduling. So external scheduling
416
417
418    6.4's description of "waitfor" tells us it is different from an if-else
419    chain but tries to use two *different* inputs to tell us that the behavior
420    is different; tell us an instance where *the same* values of C1 and C2 give
421    different behavior (I even wrote out a truth table and still don't see the
422    semantic difference)
423
424Again, it is unclear what is the problem. For the if-statement, if C1 is true,
425it only waits for a call to mem1, even if C2 is true and there is an
426outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
427is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
428happen with the if-statement. So for all true when clauses, any outstanding
429call is immediately accepted. If there are no outstanding calls, the waitfor
430blocks until the next call to any of the true when clauses occurs. I added the
431following sentence to further clarify.
432
433 Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
434
435Note, the same parallel semantics exists for the Go select statement with
436respect to waiting for a set of channels to receive data. While Go select does
437not have a when clause, it would be trivial to add it, making the select more
438expressive.
439
440
441    The authors frequently use bracketed phrases, and sometimes slashes "/", in
442    ways that are confusing and/or detrimental to readability.  Page 13 line
443    2's "forward (backward)" is one particularly egregious example.  In general
444    I would recommend the the authors try to limit their use of parentheses and
445    slashes as a means of forcing a clearer wording to emerge.
446
447Many of the slashes and parentheticals have been removed. Some are retained to
448express joined concepts: call/return, suspend/resume, resume/resume, I/O.
449
450
451    Also, the use of "eg." is often cursory and does not explain the examples
452    given, which are frequently a one- or two-word phrase of unclear referent.
453
454A few of these are fixed.
455
456
457    Considering the revision more broadly, none of the more extensive or
458    creative rewrites I suggested in my previous review have been attempted,
459    nor any equivalent efforts to improve its readability.
460
462
463        An expositional idea occurs: start the paper with a strawman
464        naive/limited realisation of coroutines -- say, Simon Tatham's popular
465        "Coroutines in C" web page -- and identify point by point what the
466        limitations are and how C\/ overcomes them. Currently the presentation
467        is often flat (lacking motivating contrasts) and backwards (stating
468        solutions before problems). The foregoing approach might fix both of
469        these.
470
471    We prefer the current structure of our paper and believe the paper does
472    explain basic coding limitations and how they are overcome in using
473    high-level control-floe mechanisms.
474
476
477
478    The hoisting of the former section 5.9 is a good idea, but the newly added
479    material accompanying it (around Table 1) suffers fresh deficiencies in
480    clarity. Overall the paper is longer than before, even though (as my
481    previous review stated), I believe a shorter paper is required in order to
482    serve the likely purpose of publication. (Indeed, the authors' letter
483    implies that a key goal of publication is to build community and gain
484    external users.)
485
486This comment is the referee's opinion, which we do not agree with it. Our
487previous 35 page SP&E paper on Cforall:
488
489  Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
490  Programming Language Features to C. Softw. Pract. Exper. 2018;
491  48(12):2111-2146.
492
493has a similar structure and style to this paper, and it received an award from
494John Wiley & Sons:
495
496 Software: Practice & Experience for articles published between January 2017
497 and December 2018, "most downloads in the 12 months following online
498 publication showing the work generated immediate impact and visibility,
499 contributing significantly to advancement in the field".
500
501So we have demonstrated an ability to build community and gain external users.
502
503
504    Given this trajectory, I no longer see a path to an acceptable revision of
505    the present submission. Instead I suggest the authors consider splitting
506    the paper in two: one half about coroutines and stack management, the other
507    about mutexes, monitors and the runtime. (A briefer presentation of the
508    runtime may be helpful in the first paper also, and a brief recap of the
509    generator and coroutine support is obviously needed in the second too.)
510
511Again we disagree with the referee's suggestion to vastly restructure the
512paper. What advantage is there is presenting exactly the same material across
513two papers, which will likely end up longer than a single paper? The current
514paper has a clear theme that fundamental execution properties generate a set of
515basic language mechanisms, and we then proceed to show how these mechanisms can
516be designed into the programing language Cforall.
517
518
519    I do not buy the authors' defense of the limited practical experience or
520    "non-micro" benchmarking presented. Yes, gaining external users is hard and
521    I am sympathetic on that point. But building something at least *somewhat*
522    substantial with your own system should be within reach, and without it the
523    "practice and experience" aspects of the work have not been explored.
524    Clearly C\/ is the product of a lot of work over an extended period, so it
525    is a surprise that no such experience is readily available for inclusion.
526
527Understood. There are no agreed-upon concurrency benchmarks, which is why
528micro-benchmarks are often used. Currently, the entire Cforall runtime is
529written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
530be thread safe, automatically detects the use of concurrency features at link
531time, and bootstraps into a threaded runtime almost immediately at program
532startup so threads can be declared as global variables and may run to
533completion before the program main starts. The concurrent core of the runtime
534is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
535and high-level features. In other words, the concurrent core uses itself as
536quickly as possible to start using high-level concurrency. There are 12,000+
537LOC in the Cforall tests-suite used to verify language features, which are run
538nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
539as aggressively testing each language feature, and classical examples such as
540bounded buffer, dating service, matrix summation, quickSort, binary insertion
541sort, etc.  More experience will be available soon, based on ongoing work in
542the "future works" section. Specifically, non-blocking I/O is working with the
543new Linux io_uring interface and a new high-performance ready-queue is under
544construction to take into account this change. With non-blocking I/O, it will
545be possible to write applications like high-performance web servers, as is now
546done in Rust and Go. Also, completed is Java-style executors for work-based
547concurrent programming and futures. Under construction is a high-performance
548actor system.
549
550
551    It does not seem right to state that a stack is essential to Von Neumann
552    architectures -- since the earliest Von Neumann machines (and indeed early
553    Fortran) did not use one.
554
555Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
556https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
557
558  Since a subprogram may call for other subprograms to any desired depth, a
559  particular CALL statement may be defined by a pyramid of multi-level
560  subprograms.
561
562I think we may be differing on the meaning of stack. You may be imagining a
563modern stack that grows and shrink dynamically. Whereas early Fortran
564preallocated a stack frame for each function, like Python allocates a frame for
565a generator.  Within each preallocated Fortran function is a frame for local
566variables and a pointer to store the return value for a call.  The Fortran
567call/return mechanism than uses these frames to build a traditional call stack
568linked by the return pointer. The only restriction is that a function stack
569frame can only be used once, implying no direct or indirect recursion.  Hence,
570without a stack mechanism, there can be no call/return to "any desired depth",
571where the maximum desired depth is limited by the number of functions. So
572call/return requires some form of a stack, virtually all programming language
573have call/return, past and present, and these languages run on Von Neumann
574machines that do not distinguish between program and memory space, have mutable
575state, and the concept of a pointer to data or code.
576
577
578    To elaborate on something another reviewer commented on: it is a surprise
579    to find a "Future work" section *after* the "Conclusion" section. A
580    "Conclusions and future work" section often works well.
581
582Done.
583
584
585
586Reviewing: 3
587
588    but it remains really difficult to have a good sense of which idea I should
589    use and when. This applies in different ways to different features from the
590    language:
591
592    * coroutines/generators/threads: here there is some discussion, but it can
593      be improved.
594    * interal/external scheduling: I didn't find any direct comparison between
595      these features, except by way of example.
596
597See changes below.
598
599
600    I would have preferred something more like a table or a few paragraphs
601    highlighting the key reasons one would pick one construct or the other.
602
603Section 2.1 is changed to address this suggestion.
604
605
606    The discussion of clusters and pre-emption in particular feels quite rushed.
607
608We believe a brief introduction to the Cforall runtime structure is important
609because clustering within a user-level versus distributed system is unusual,
610Furthermore, the explanation of preemption is important because several new
611languages, like Go and Rust tokio, are not preemptive. Rust threads are
612preemptive only because it uses kernel threads, which UNIX preempts.
613
614
615    * Recommend to shorten the comparison on coroutine/generator/threads in
616      Section 2 to a paragraph with a few examples, or possibly a table
617      explaining the trade-offs between the constructs
618
619Not done, see below.
620
621
622    * Recommend to clarify the relationship between internal/external
623      scheduling -- is one more general but more error-prone or low-level?
624
625Done, see below.
626
627
628    There is obviously a lot of overlap between these features, and in
629    particular between coroutines and generators. As noted in the previous
630    review, many languages have chosen to offer *only* generators, and to
631    build coroutines by stacks of generators invoking one another.
632
633As we point out, coroutines built from stacks of generators have problems, such
634as no symmetric control-flow. Furthermore, stacks of generators have a problem
635with the following programming pattern.  logger is a library function called
636from a function or a coroutine, where the doit for the coroutine suspends. With
637stacks of generators, there has to be a function and generator version of
638logger to support these two scenarios. If logger is a library function, it may
639be impossible to create the generator logger because the logger function is
640opaque.
641
642  #include <fstream.hfa>
643  #include <coroutine.hfa>
644
645  forall( otype T | { void doit( T ); } )
646  void logger( T & t ) {
647      doit( t );
648  }
649
650  coroutine C {};
651  void main( C & c ) with( c ) {
652      void doit( C & c ) { suspend; }
653      logger( c );
654  }
655  void mem( C & c ) {
656      resume( c );
657  }
658
659  int main() {
660      C c;
661      mem( c );
662      mem( c );
663
664      struct S {};
665      S s;
666      void doit( S & s ) {}
667      logger( s );
668  }
669
670
671
672    In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
673    that embodies this "top down" approach. It starts, "programmers can now
674    answer three basic questions", and thus gives some practical advice for
675    which construct you should use and when. I think giving some examples of
676    specific applications that this paragraph, combined with some examples of
677    cases where each construct was needed, would be a better approach.
678
679    I don't think this comparison needs to be very long. It seems clear enough
680    that one would
681
682    * prefer generators for simple computations that yield up many values,
683
684This description does not cover output or matching generators that do not yield
685many or any values. For example, the output generator Fmt yields no value; the
686device driver yields a value occasionally once a message is found. Furthermore,
687real device drivers are not simple; there can have hundreds of states and
688transitions. Imagine the complex event-engine for a web-server written as a
689generator.
690
691    * prefer coroutines for more complex processes that have significant
692      internal structure,
693
694As for generators, complexity is not the criterion for selection. A coroutine
695brings generality to the implementation because of the addition stack, whereas
696generators have restrictions on standard software-engining practises: variable
697placement, no helper functions without creating an explicit generator stack,
698and no symmetric control-flow. Whereas, the producer/consumer example in Figure
6997 uses stack variable placement, helpers, and simple ping/pong-style symmetric
700control-flow.
701
702    * prefer threads for cases where parallel execution is desired or needed.
703
704Agreed, but this description does not mention mutual exclusion and
705synchronization, which is essential in any meaningful concurrent program.
706
707Our point here is to illustrate that a casual "top down" explanation is
708insufficient to explain the complexity of the underlying execution properties.
709We presented some rule-of-thumbs at the end of Section 2 but programmers must
710understand all the underlying mechanisms and their interactions to exploit the
711execution properties to their fullest, and to understand when a programming
712language does or does not provide a desired mechanism.
713
714
715    I did appreciate the comparison in Section 2.3 between async-await in
716    JS/Java and generators/coroutines. I agree with its premise that those
717    mechanisms are a poor replacement for generators (and, indeed, JS has a
718    distinct generator mechanism, for example, in part for this reason).  I
719    believe I may have asked for this in a previous review, but having read it,
720    I wonder if it is really necessary, since those mechanisms are so different
721    in purpose.
722
724ask the same question because async-await is very popular. So I think this
725section does help to position the work in the paper among other work, and
726hence, it is appropriate to keep it in the paper.
727
728
729    I find the motivation for supporting both internal and external scheduling
730    to be fairly implicit. After several reads through the section, I came to
731    the conclusion that internal scheduling is more expressive than external
732    scheduling, but sometimes less convenient or clear. Is this correct? If
733    not, it'd be useful to clarify where external scheduling is more
734    expressive.
735
736    I would find it very interesting to try and capture some of the properties
737    that make internal vs external scheduling the better choice.
738
739    For example, it seems to me that external scheduling works well if there
740    are only a few "key" operations, but that internal scheduling might be
741    better otherwise, simply because it would be useful to have the ability to
742    name a signal that can be referenced by many methods.
743
744To address this point, the last paragraph on page 22 (now page 23) has been
745augmented to the following:
746
747 Given external and internal scheduling, what guidelines can a programmer use
748 to select between them?  In general, external scheduling is easier to
749 understand and code because only the next logical action (mutex function(s))
750 is stated, and the monitor implicitly handles all the details.  Therefore,
751 there are no condition variables, and hence, no wait and signal, which reduces
752 coding complexity and synchronization errors.  If external scheduling is
753 simpler than internal, why not use it all the time?  Unfortunately, external
754 scheduling cannot be used if: scheduling depends on parameter value(s) or
755 scheduling must block across an unknown series of calls on a condition
756 variable (\ie internal scheduling).  For example, the dating service cannot be
757 written using external scheduling.  First, scheduling requires knowledge of
758 calling parameters to make matching decisions and parameters of calling
759 threads are unavailable within the monitor.  Specifically, a thread within the
760 monitor cannot examine the @ccode@ of threads waiting on the calling queue to
761 determine if there is a matching partner.  (Similarly, if the bounded buffer
762 or readers/writer are restructured with a single interface function with a
763 parameter denoting producer/consumer or reader/write, they cannot be solved
764 with external scheduling.)  Second, a scheduling decision may be delayed
765 across an unknown number of calls when there is no immediate match so the
766 thread in the monitor must block on a condition.  Specifically, if a thread
767 determines there is no opposite calling thread with the same @ccode@, it must
768 wait an unknown period until a matching thread arrives.  For complex
769 synchronization, both external and internal scheduling can be used to take
770 advantage of best of properties of each.
771
772
773    Consider the bounded buffer from Figure 13: if it had multiple methods for
774    removing elements, and not just remove, then the waitfor(remove) call
775    in insert might not be sufficient.
776
777Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
779
780  waitfor( remove : buffer ); or waitfor( remove2 : buffer );
781
782and its shorthand form (not shown in the paper)
783
784  waitfor( remove, remove2 : t );
785
786A call to one these remove functions satisfies the waitfor (exact selection
787details are discussed in Section 6.4).
788
789
790    The same is true, I think, of the signal_block function, which I
791    have not encountered before;
792
793In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
794Structuring Concept", it states on page 551:
795
796 When a process signals a condition on which another process is waiting, the
797 signalling process must wait until the resumed process permits it to
798 proceed. We therefore introduce for each monitor a second semaphore "urgent"
799 (initialized to 0), on which signalling processes suspend themselves by the
800 operation P(urgent).
801
802Hence, the original definition of signal is in fact signal_block, i.e., the
803signaller blocks. Later implementations of monitor switched to signaller
804nonblocking because most signals occur before returns, which allows the
805signaller to continue execution, exit the monitor, and run concurrently with
806the signalled thread that restarts in the monitor. When the signaller is not
807going to exit immediately, signal_block is appropriate.
808
809
810    it seems like its behavior can be modeled with multiple condition
811    variables, but that's clearly more complex.
812
813Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
814Surveys, 27(1):63-107, that all extant monitors with different signalling
815semantics can be transformed into each other. However, some transformations are
816complex and runtime expensive.
817
818
819    One question I had about signal_block: what happens if one signals
820    but no other thread is waiting? Does it block until some other thread
821    waits? Or is that user error?
822
823On page 20, it states:
824
825  Signalling is unconditional because signalling an empty condition queue does
826  nothing.
827
828To the best of our knowledge, all monitors have the same semantics for
829signalling an empty condition queue, regardless of the kind of signal, i.e.,
830signal or signal_block.
831
832    I believe that one difference between the Go program and the Cforall
833    equivalent is that the Goroutine has an associated queue, so that
834    multiple messages could be enqueued, whereas the Cforall equivalent is
835    effectively a "bounded buffer" of length 1. Is that correct?
836
837Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
838send so both are synchronous communication.
839
840    I think this should be stated explicitly. (Presumably, one could modify the
841    Cforall program to include an explicit vector of queued messages if
842    desired, but you would also be reimplementing the channel abstraction.)
843
844Fixed, by adding the following sentences:
845
846  The different between call and channel send occurs for buffered channels
847  making the send asynchronous.  In \CFA, asynchronous call and multiple
848  buffers is provided using an administrator and worker
849  threads~\cite{Gentleman81} and/or futures (not discussed).
850
851
852    Also, in Figure 20, I believe that there is a missing mutex keyword.
853
854Fixed.
855
856
857    I was glad to see that the paper acknowledged that Cforall still had
858    low-level atomic operations, even if their use is discouraged in favor of
859    higher-level alternatives.
860
861There was never an attempt to not acknowledged that Cforall had low-level
862atomic operations. The original version of the paper stated:
863
864  6.6 Low-level Locks
865  For completeness and efficiency, Cforall provides a standard set of low-level
866  locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
867  instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
868
869and that section is still in the paper. In fact, we use these low-level
870mechanisms to build all of the high-level concurrency constructs in Cforall.
871
872
873    However, I still feel that the conclusion overstates the value of the
874    contribution here when it says that "Cforall high-level race-free monitors
875    and threads provide the core mechanisms for mutual exclusion and
876    synchronization, without the need for volatile and atomics". I feel
877    confident that Java programmers, for example, would be advised to stick
878    with synchronized methods whenever possible, and it seems to me that they
879    offer similar advantages -- but they sometimes wind up using volatiles for
880    performance reasons.
881
882I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
883programs can achieve very good performance without volatile or atomics because
884the runtime system has already used these low-level capabilities to build an
885efficient set of high-level concurrency constructs.
886
8870.1% of the time programmers need to build their own locks and synchronization
888mechanisms. This need also occurs for storage management. Both of these
889mechanisms are allowed in Cforall but are fraught with danger and should be
890discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
891fencing for a WSO memory model, such as on the ARM. It doesn't help that the
892C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
893Dave Dice and me would agree that 99% of hand-crafted locks created by
894programmers are broken and/or non-portable.
895
896
897    I was also confused by the term "race-free" in that sentence. In
898    particular, I don't think that Cforall has any mechanisms for preventing
899    *data races*, and it clearly doesn't prevent "race conditions" (which would
900    bar all sorts of useful programs). I suppose that "race free" here might be
901    referring to the improvements such as removing barging behavior.
902
903We use the term "race free" to mean the same as Boehm/Adve's "data-race
904freedom" in
905
906  Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
907  Memory Models. Communications ACM. 2012; 55(2):48-54.
908  https://queue.acm.org/detail.cfm?id=2088916
909
910which is cited in the paper. Furthermore, we never said that Cforall has
911mechanisms for preventing *all* data races. We said Cforall high-level
913to the paper), implies no data races within these constructs, unless a
914programmer directly publishes shared state. This approach is exactly what
916
917
918    It would perhaps be more interesting to see a comparison built using [tokio] or
919    [async-std], two of the more prominent user-space threading libraries that
920    build on Rust's async-await feature (which operates quite differently than
921    Javascript's async-await, in that it doesn't cause every aync function call to
923
924Done.
925
926
927    Several figures used the with keyword. I deduced that with(foo) permits
928    one to write bar instead of foo.bar. It seems worth introducing.
929    Apologies if this is stated in the paper, if so I missed it.
930
931Page 6, footnote F states:
932
933  The Cforall "with" clause opens an aggregate scope making its fields directly
934  accessible, like Pascal "with", but using parallel semantics; multiple
935  aggregates may be opened.
936
937
938    On page 20, section 6.3, "external scheduling and vice versus" should be
939    "external scheduling and vice versa".
940
941Fixed.
942
943
944    On page 5, section 2.3, the paper states "we content" but it should be "we
945    contend".
946
947Fixed.
948
949
950    Page 1. I don't believe that it s fair to imply that Scala is "research
951    vehicle" as it is used by major players, Twitter being the most prominent
952    example.
953
954Fixed. Removed "research".
955
956
957    Page 15. Must Cforall threads start after construction (e.g. see your
958    example on page 15, line 21)?
959
960Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
961programmers want threads to start immediately. (Most Java programs have no code
962between a thread declaration and the call to start the thread.)  Therefore,
963this semantic should be the default because (see page 13):
964
965  Alternatives, such as explicitly starting threads as in Java, are repetitive
966  and forgetting to call start is a common source of errors.
967
968To handle the other 5% of the cases, there is a trivial Cforall pattern
969providing Java-style start/join. The additional cost for this pattern is 2
971
973  void start( T & mutex ) {} // any function name
974  void join( T & mutex ) {} // any function name
975  void main( T & t ) {
976      sout | "start";
977      waitfor( start : t ); // wait to be started
978      sout | "restart"; // perform work
979      waitfor( join : t ); // wait to be joined
980      sout | "join";
981  }
982  int main() {
983      T t[3]; // threads start and delay
984      sout | "need to start";
985      for ( i; 3 ) start( t[i] );
986      sout | "need to join";
987      for ( i; 3 ) join( t[i] );
989  } // threads deleted from stack
990
991  \$ a.out
992  need to start
993  start
994  start
995  start
996  need to join
997  restart
998  restart
999  restart