source: doc/papers/concurrency/response2 @ f53afafb

ADTast-experimentalenumpthread-emulationqualifiedEnum
Last change on this file since f53afafb was 016b1eb, checked in by Peter A. Buhr <pabuhr@…>, 4 years ago

final changes for round 2 of the SP&E concurrency paper

  • Property mode set to 100644
File size: 44.7 KB
Line 
1Reviewing: 1
2
3    I still have a couple of issues --- perhaps the largest is that it's
4    still not clear at this point in the paper what some of these options
5    are, or crucially how they would be used. I don't know if it's
6    possible to give high-level examples or use cases to be clear about
7    these up front - or if that would duplicate too much information from
8    later in the paper - either way expanding out the discussion - even if
9    just two a couple of sentences for each row - would help me more.
10
11Section 2.1 is changed to address this suggestion.
12
13
14    * 1st para section 2 begs the question: why not support each
15      dimension independently, and let the programmer or library designer
16      combine features?
17
18As seen in Table 1, not all of the combinations work, and having programmers
19directly use these low-level mechanisms is error prone. Accessing these
20fundamental mechanisms through higher-level constructs has always been the
21purpose of a programming language.
22
23
24    * Why must there "be language mechanisms to create, block/unblock, and join
25      with a thread"?  There aren't in Smalltalk (although there are in the
26      runtime).  Especially given in Cforall those mechanisms are *implicit* on
27      thread creation and destruction?
28
29Fixed, changed sentence to:
30
31 A programmer needs mechanisms to create, block and unblock, and join with a
32 thread, even if these basic mechanisms are supplied indirectly through
33 high-level features.
34
35
36   * "Case 1 is a function that borrows storage for its state (stack
37     frame/activation) and a thread from its invoker"
38 
39     this much makes perfect sense to me, but I don't understand how a
40     non-stateful, non-threaded function can then retain
41 
42     "this state across callees, ie, function local-variables are
43     retained on the stack across calls."
44 
45     how can it retain function-local values *across calls* when it
46     doesn't have any functional-local state?
47
48In the following example:
49
50  void foo() {
51     // local variables and code
52  }
53  void bar() {
54     // local variables
55     foo();
56  }
57
58bar is the caller and foo is the callee. bar borrows the program stack and
59thread to make the call to foo. When foo, the callee, is executing, bar's local
60variables (state) is retained on the *borrowed* stack across the call. (Note, I
61added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
62foo's local variables are also retain on the borrowed stack. When foo and bar
63return, all of their local state is gone (not retained). This behaviour is
64standard call/return semantics in an imperative language.
65
66
67     I'm not sure if I see two separate cases here - roughly equivalent
68     to C functions without static storage, and then C functions *with*
69     static storage.
70
71Yes, but there is only one instance of the static storage across all
72activations of the C function. For generators and coroutines, each instance has
73its own state, like an object in an OO language.
74
75
76     I assumed that was the distinction between cases 1 & 3; but perhaps the
77     actual distinction is that 3 has a suspend/resume point, and so the
78     "state" in figure 1 is this component of execution state (viz figs 1 & 2),
79     not the state representing the cross-call variables?
80
81So case 3 is like an object with the added ability to retain where it was last
82executing.  When a generator is resumed, the generator object (structure
83instance) is passed as an explicit reference, and within this object is the
84restart location in the generator's "main". When the generator main executes,
85it uses the borrowed stack for its local variables and any functions it calls,
86just like an object member borrows the stack for its local variables but also
87has an implicit receiver to the object state.  A generator can have static
88storage, too, which is a single instance across all generator instances of that
89type, as for static storage in an object type. All the kinds of storage are
90at play with semantics that is the same as in other languages.
91
92
93    > but such evaluation isn't appropriate for garbage-collected or JITTed
94    > languages like Java or Go.
95    For JITTed languages in particular, reporting peak performance needs to
96    "warm up" the JIT with a number of iterators before beginning
97    measurement. Actually for JIT's its even worse: see Edd Barrett et al
98    OOPSLA 2017.
99   
100Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
101test-programs correctly measured the specific feature, we consulted with Dave
102Dice at Oracle who works directly on the development of the Oracle JVM
103Just-in-Time Compiler. We modified our test programs based on his advise, and
104he validated our programs as correctly measuring the specified language
105feature. Dave's help is recognized in the Acknowledgment section.  Similarly,
106we verified our Node.js programs with Gregor Richards, an expert in
107just-in-time compilation for dynamic typing.  Hence, we have taken into account
108all issues related to performing benchmarks in JITTED languages.  Also, all the
109benchmark programs are publicly available for independent verification.
110
111
112   * footnote A - I've looked at various other papers & the website to try to
113     understand how "object-oriented" Cforall is - I'm still not sure.  This
114     footnote says Cforall has "virtuals" - presumably virtual functions,
115     i.e. dynamic dispatch - and inheritance: that really is OO as far as I
116     (and most OO people) are concerned.  For example Haskell doesn't have
117     inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
118     or things like Cecil and Dylan are considered OO even though they have
119     "multiple function parameters as receivers", lack "lexical binding between
120     a structure and set of functions", and don't have explicit receiver
121     invocation syntax.  Python has receiver syntax, but unlike Java or
122     Smalltalk or C++, method declarations still need to have an explicit
123     "self" receiver parameter.  Seems to me that Go, for example, is
124     more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
125     and an explicit receiver syntax but that's not determinative); while Rust
126     lacks dynamic dispatch built-in.  C is not OO as a language, but as you
127     say given it supports function pointers with structures, it does support
128     an OO programming style.
129   
130     This is why I again recommend just not buying into this fight: not making
131     any claims about whether Cforall is OO or is not - because as I see it,
132     the rest of the paper doesn't depend on whether Cforall is OO or not.
133     That said: this is just a recommendation, and I won't quibble over this
134     any further.
135
136We believe it is important to identify Cforall as a non-OO language because it
137heavily influences the syntax and semantics used to build its concurrency.
138Since many aspects of Cforall are not OO, the rest of the paper *does* depend
139on Cforall being identified as non-OO, otherwise readers would have
140significantly different expectations for the design. We did not mean to suggest
141that languages that support function pointers with structures support an OO
142style. We revised the footnote to avoid this interpretation. Finally, Golang
143does not identify itself as an OO language.
144https://golang.org/doc/faq#Is_Go_an_object-oriented_language
145
146
147   * is a "monitor function" the same as a "mutex function"?
148     if so the paper should pick one term; if not, make the distinction clear.
149
150Fixed. Picked "mutex". Changed the language and all places in the paper.
151
152
153   * "As stated on line 1 because state declarations from the generator
154      type can be moved out of the coroutine type into the coroutine main"
155 
156      OK sure, but again: *why* would a programmer want to do that?
157      (Other than, I guess, to show the difference between coroutines &
158      generators?)  Perhaps another way to put this is that the first
159      para of 3.2 gives the disadvantages of coroutines vs-a-vs
160      generators, briefly describes the extended semantics, but never
161      actually says why a programmer may want those extended semantics,
162      or how they would benefit.  I don't mean to belabour the point,
163      but (generalist?) readers like me would generally benefit from
164      those kinds of discussions about each feature throughout the
165      paper: why might a programmer want to use them?
166
167On page 8, it states:
168
169  Having to manually create the generator closure by moving local-state
170  variables into the generator type is an additional programmer burden (removed
171  by the coroutine in Section 3.2). ...
172
173also these variables can now be refactored into helper function, where the
174helper function suspends at arbitrary call depth. So imagine a coroutine helper
175function that is only called occasionally within the coroutine but it has a
176large array that is retained across suspends within the helper function. For a
177generator, this large array has to be declared in the generator type enlarging
178each generator instance even through the array is only used occasionally.
179Whereas, the coroutine only needs the array allocated when needed. Now a
180coroutine has a stack which occupies storage, but the maximum stack size only
181needs to be the call chain allocating the most storage, whereas the generator
182has a maximum size of all variable that could be created.
183
184
185    > p17 if the multiple-monitor entry procedure really is novel, write a paper
186    > about that, and only about that.
187    > We do not believe this is a practical suggestion.
188    * I'm honestly not trying to be snide here: I'm not an expert on monitor or
189      concurrent implementations. Brinch Hansen's original monitors were single
190      acquire; this draft does not cite any other previous work that I could
191      see. I'm not suggesting that the brief mention of this mechanism
192      necessarily be removed from this paper, but if this is novel (and a clear
193      advance over a classical OO monitor a-la Java which only acquires the
194      distinguished receiver) then that would be worth another paper in itself.
195
196First, to explain multiple-monitor entry in Cforall as a separate paper would
197require significant background on Cforall concurrency, which means repeating
198large sections of this paper. Second, it feels like a paper just on
199multiple-monitor entry would be a little thin, even if the capability is novel
200to the best of our knowledge. Third, we feel multiple-monitor entry springs
201naturally from the overarching tone in the paper that Cforall is a non-OO
202programming language, allowing multiple-mutex receivers.
203
204
205    My typo: the paper's conclusion should come at the end, after the
206    future work section.
207
208Combined into a Conclusions and Future Work section.
209
210
211
212Reviewing: 2
213
214    on the Boehm paper and whether code is "all sequential to the compiler": I
215    now understand the authors' position better and suspect we are in violent
216    agreement, except for whether it's appropriate to use the rather breezy
217    phrase "all sequential to the compiler". It would be straightforward to
218    clarify that code not using the atomics features is optimized *as if* it
219    were sequential, i.e. on the assumption of a lack of data races.
220
221Fixed, "as inline and library code is compiled as sequential without any
222explicit concurrent directive."
223
224
225    on the distinction between "mutual exclusion" and "synchronization": the
226    added citation does help, in that it makes a coherent case for the
227    definition the authors prefer. However, the text could usefully clarify
228    that this is a matter of definition not of fact, given especially that in
229    my assessment the authors' preferred definition is not the most common
230    one. (Although the mention of Hoare's apparent use of this definition is
231    one data point, countervailing ones are found in many contemporaneous or
232    later papers, e.g. Habermann's 1972 "Synchronization of Communicating
233    Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
234    eventcounts and sequencers" (CACM (22(2)) and so on.)
235
236Fixed, "We contend these two properties are independent, ...".
237
238With respect to the two papers, Habermann fundamentally agrees with our
239definitions, where the term mutual exclusion is the same but Habermann uses the
240term "communication" for our synchronization. However, the term "communication"
241is rarely used to mean synchronization. The fact that Habermann collectively
242calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
243state:
244
245  By mutual exclusion we mean any mechanism that forces the time ordering of
246  execution of pieces of code, called critical regions, in a system of
247  concurrent processes to be a total ordering.
248
249But there is no timing order for a critical region (which I assume means
250critical section); threads can arrive at any time in any order, where the
251mutual exclusion for the critical section ensures one thread executes at a
252time. Interestingly, Reed & Kanodia's critical region is Habermann's
253communication not critical section. These papers only buttress our contention
254about the confusion of these terms in the literature.
255
256
257    section 2 (an expanded version of what was previously section 5.9) lacks
258    examples and is generally obscure and allusory ("the most advanced feature"
259    -- name it! "in triplets" -- there is only one triplet!;
260
261Fixed.
262
263 These independent properties can be used to compose different language
264 features, forming a compositional hierarchy, where the combination of all
265 three is the most advanced feature, called a thread/task/process. While it is
266 possible for a language to only provide threads for composing
267 programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
268 solutions to certain classes of problems.
269
270
271    what are "execution locations"? "initialize" and "de-initialize"
272
273Fixed through an example at the end of the sentence.
274
275 \item[\newterm{execution state}:] is the state information needed by a
276 control-flow feature to initialize and manage both compute data and execution
277 location(s), and de-initialize. For example, calling a function initializes a
278 stack frame including contained objects with constructors, manages local data
279 in blocks and return locations during calls, and de-initializes the frame by
280 running any object destructors and management operations.
281
282
283    what? "borrowed from the invoker" is a concept in need of explaining or at
284    least a fully explained example -- in what sense does a plain function
285    borrow" its stack frame?
286
287A function has no storage except for static storage; it gets its storage from
288the program stack. For a function to run it must borrow storage from somewhere.
289When the function returns, the borrowed storage is returned. Do you have a more
290appropriate word?
291
292
293    "computation only" as opposed to what?
294
295This is a term in concurrency for operations that compute without blocking,
296i.e., the operation starts with everything it needs to compute its result and
297runs to completion, blocking only when it is done and returns its result.
298Computation only occurs in "embarrassingly parallel" problems.
299
300
301    in 2.2, in what way is a "request" fundamental to "synchronization"?
302
303I assume you are referring to the last bullet.
304
305 Synchronization must be able to control the service order of requests
306 including prioritizing selection from different kinds of outstanding requests,
307 and postponing a request for an unspecified time while continuing to accept
308 new requests.
309
310Habermann states on page 173
311
312 Looking at deposit we see that sender and receiver should be synchronized with
313 respect to buffer overflow: deposit of another message must be delayed if
314 there is no empty message frame, and such delay should be removed when the
315 first empty frame becomes available.  This is programmed as
316 synchronization(frame) : deposit is preceded by wait(frame), accept is
317 followed by signal( frame), and the constant C[frame] is set equal to
318 "bufsize."
319
320Here synchronization is controlling the service order of requests: when the
321buffer is full, requests for insert (sender) are postponed until a request to
322remove (receiver) occurs.  Without the ability to control service order among
323requests, the producer/consumer problem with a bounded buffer cannot be
324solved. Hence, this capability is fundamental.
325
326
327    and the "implicitly" versus "explicitly" point needs stating as elsewhere,
328    with a concrete example e.g. Java built-in mutexes versus
329    java.util.concurrent).
330
331Fixed.
332
333 MES must be available implicitly in language constructs, \eg Java built-in
334 monitors, as well as explicitly for specialized requirements, \eg
335 @java.util.concurrent@, because requiring programmers to build MES using
336 low-level locks often leads to incorrect programs.
337
338
339    section 6: 6.2 omits the most important facts in preference for otherwise
340    inscrutable detail: "identify the kind of parameter" (first say *that there
341    are* kinds of parameter, and what "kinds" means!); "mutex parameters are
342    documentation" is misleading (they are also semantically significant!) and
343    fails to say *what* they mean; the most important thing is surely that
344    'mutex' is a language feature for performing lock/unlock operations at
345    function entry/exit. So say it!
346
347These sections have been rewritten address the comments.
348
349
350    The meanings of examples f3 and f4 remain unclear.
351
352Rewrote the paragraph.
353
354
355    Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
356    meaning from Figure 12,
357
358Defined Hoare's urgent list at the start of the paragraph.
359
360
361    but that Figure is incomprehensible to me), and we are told of "external
362    scheduling"'s long history in Ada but not clearly what it actually means;
363
364We do not know how to address your comment because the description is clear to
365us and other non-reviewers who have read the paper. I had forgotten an
366important citation to prior work on this topic, which is now referenced:
367
368   Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
369   Classification. ACM Computing Surveys. 1995; 27(1):63-107.
370
371This citation is a 45 page paper expanding on the topic of internal scheduling.
372I also added a citation to Hoare's monitor paper where signal_block is defined:
373
374  When a process signals a condition on which another process is waiting, the
375  signalling process must wait until the resumed process permits it to
376  proceed.
377
378External scheduling from Ada was subsequently added to uC++ and Cforall.
379Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
380external-scheduling with Go channel external-scheduling. So external scheduling
381exists in languages beyond Ada.
382
383
384    6.4's description of "waitfor" tells us it is different from an if-else
385    chain but tries to use two *different* inputs to tell us that the behavior
386    is different; tell us an instance where *the same* values of C1 and C2 give
387    different behavior (I even wrote out a truth table and still don't see the
388    semantic difference)
389
390Again, it is unclear what is the problem. For the if-statement, if C1 is true,
391it only waits for a call to mem1, even if C2 is true and there is an
392outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
393is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
394happen with the if-statement. So for all true when clauses, any outstanding
395call is immediately accepted. If there are no outstanding calls, the waitfor
396blocks until the next call to any of the true when clauses occurs. I added the
397following sentence to further clarify.
398
399 Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
400
401Note, the same parallel semantics exists for the Go select statement with
402respect to waiting for a set of channels to receive data. While Go select does
403not have a when clause, it would be trivial to add it, making the select more
404expressive.
405
406
407    The authors frequently use bracketed phrases, and sometimes slashes "/", in
408    ways that are confusing and/or detrimental to readability.  Page 13 line
409    2's "forward (backward)" is one particularly egregious example.  In general
410    I would recommend the the authors try to limit their use of parentheses and
411    slashes as a means of forcing a clearer wording to emerge.
412
413Many of the slashes and parentheticals have been removed. Some are retained to
414express joined concepts: call/return, suspend/resume, resume/resume, I/O.
415
416
417    Also, the use of "eg." is often cursory and does not explain the examples
418    given, which are frequently a one- or two-word phrase of unclear referent.
419
420A few of these are fixed.
421
422
423    Considering the revision more broadly, none of the more extensive or
424    creative rewrites I suggested in my previous review have been attempted,
425    nor any equivalent efforts to improve its readability.
426
427If you reread the previous response, we addressed all of your suggestions except
428
429        An expositional idea occurs: start the paper with a strawman
430        naive/limited realisation of coroutines -- say, Simon Tatham's popular
431        "Coroutines in C" web page -- and identify point by point what the
432        limitations are and how C\/ overcomes them. Currently the presentation
433        is often flat (lacking motivating contrasts) and backwards (stating
434        solutions before problems). The foregoing approach might fix both of
435        these.
436   
437    We prefer the current structure of our paper and believe the paper does
438    explain basic coding limitations and how they are overcome in using
439    high-level control-floe mechanisms.
440
441and we have addressed readability issues in this version.
442
443
444    The hoisting of the former section 5.9 is a good idea, but the newly added
445    material accompanying it (around Table 1) suffers fresh deficiencies in
446    clarity. Overall the paper is longer than before, even though (as my
447    previous review stated), I believe a shorter paper is required in order to
448    serve the likely purpose of publication. (Indeed, the authors' letter
449    implies that a key goal of publication is to build community and gain
450    external users.)
451
452This comment is the referee's opinion, which we do not agree with it. Our
453previous 35 page SP&E paper on Cforall:
454
455  Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
456  Programming Language Features to C. Softw. Pract. Exper. 2018;
457  48(12):2111-2146.
458
459has a similar structure and style to this paper, and it received an award from
460John Wiley & Sons:
461
462 Software: Practice & Experience for articles published between January 2017
463 and December 2018, "most downloads in the 12 months following online
464 publication showing the work generated immediate impact and visibility,
465 contributing significantly to advancement in the field".
466
467So we have demonstrated an ability to build community and gain external users.
468
469 
470    Given this trajectory, I no longer see a path to an acceptable revision of
471    the present submission. Instead I suggest the authors consider splitting
472    the paper in two: one half about coroutines and stack management, the other
473    about mutexes, monitors and the runtime. (A briefer presentation of the
474    runtime may be helpful in the first paper also, and a brief recap of the
475    generator and coroutine support is obviously needed in the second too.)
476
477Again we disagree with the referee's suggestion to vastly restructure the
478paper. What advantage is there is presenting exactly the same material across
479two papers, which will likely end up longer than a single paper? The current
480paper has a clear theme that fundamental execution properties generate a set of
481basic language mechanisms, and we then proceed to show how these mechanisms can
482be designed into the programing language Cforall.
483
484
485    I do not buy the authors' defense of the limited practical experience or
486    "non-micro" benchmarking presented. Yes, gaining external users is hard and
487    I am sympathetic on that point. But building something at least *somewhat*
488    substantial with your own system should be within reach, and without it the
489    "practice and experience" aspects of the work have not been explored.
490    Clearly C\/ is the product of a lot of work over an extended period, so it
491    is a surprise that no such experience is readily available for inclusion.
492
493Understood. There are no agreed-upon concurrency benchmarks, which is why
494micro-benchmarks are often used. Currently, the entire Cforall runtime is
495written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
496be thread safe, automatically detects the use of concurrency features at link
497time, and bootstraps into a threaded runtime almost immediately at program
498startup so threads can be declared as global variables and may run to
499completion before the program main starts. The concurrent core of the runtime
500is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
501and high-level features. In other words, the concurrent core uses itself as
502quickly as possible to start using high-level concurrency. There are 12,000+
503LOC in the Cforall tests-suite used to verify language features, which are run
504nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
505as aggressively testing each language feature, and classical examples such as
506bounded buffer, dating service, matrix summation, quickSort, binary insertion
507sort, etc.  More experience will be available soon, based on ongoing work in
508the "future works" section. Specifically, non-blocking I/O is working with the
509new Linux io_uring interface and a new high-performance ready-queue is under
510construction to take into account this change. With non-blocking I/O, it will
511be possible to write applications like high-performance web servers, as is now
512done in Rust and Go. Also, completed is Java-style executors for work-based
513concurrent programming and futures. Under construction is a high-performance
514actor system.
515
516
517    It does not seem right to state that a stack is essential to Von Neumann
518    architectures -- since the earliest Von Neumann machines (and indeed early
519    Fortran) did not use one.
520
521Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
522https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
523
524  Since a subprogram may call for other subprograms to any desired depth, a
525  particular CALL statement may be defined by a pyramid of multi-level
526  subprograms.
527
528I think we may be differing on the meaning of stack. You may be imagining a
529modern stack that grows and shrinks dynamically. Whereas early Fortran
530preallocated a stack frame for each function, like Python allocates a frame for
531a generator.  Within each preallocated Fortran function is a frame for local
532variables and a pointer to store the return value for a call.  The Fortran
533call/return mechanism uses these frames to build a traditional call stack
534linked by the return pointer. The only restriction is that a function stack
535frame can only be used once, implying no direct or indirect recursion.  Hence,
536without a stack mechanism, there can be no call/return to "any desired depth",
537where the maximum desired depth is limited by the number of functions. So
538call/return requires some form of a stack, virtually all programming language
539have call/return, past and present, and these languages run on Von Neumann
540machines that do not distinguish between program and memory space, have mutable
541state, and the concept of a pointer to data or code.
542
543
544    To elaborate on something another reviewer commented on: it is a surprise
545    to find a "Future work" section *after* the "Conclusion" section. A
546    "Conclusions and future work" section often works well.
547
548Done.
549
550
551
552Reviewing: 3
553
554    but it remains really difficult to have a good sense of which idea I should
555    use and when. This applies in different ways to different features from the
556    language:
557
558    * coroutines/generators/threads: here there is some discussion, but it can
559      be improved.
560    * internal/external scheduling: I didn't find any direct comparison between
561      these features, except by way of example.
562
563See changes below.
564
565
566    I would have preferred something more like a table or a few paragraphs
567    highlighting the key reasons one would pick one construct or the other.
568
569Section 2.1 is changed to address this suggestion.
570
571
572    The discussion of clusters and pre-emption in particular feels quite rushed.
573
574We believe a brief introduction to the Cforall runtime structure is important
575because clustering within a user-level versus distributed system is unusual,
576Furthermore, the explanation of preemption is important because several new
577languages, like Go and Rust tokio, are not preemptive. Rust threads are
578preemptive only because it uses kernel threads, which UNIX preempts.
579
580
581    * Recommend to shorten the comparison on coroutine/generator/threads in
582      Section 2 to a paragraph with a few examples, or possibly a table
583      explaining the trade-offs between the constructs
584
585Not done, see below.
586
587
588    * Recommend to clarify the relationship between internal/external
589      scheduling -- is one more general but more error-prone or low-level?
590
591Done, see below.
592
593
594    There is obviously a lot of overlap between these features, and in
595    particular between coroutines and generators. As noted in the previous
596    review, many languages have chosen to offer *only* generators, and to
597    build coroutines by stacks of generators invoking one another.
598
599Coroutines built from stacks of generators have problems, such as no symmetric
600control-flow. Furthermore, stacks of generators have a problem with the
601following programming pattern.  logger is a library function called from a
602function or a coroutine, where the doit for the coroutine suspends. With stacks
603of generators, there has to be a function and generator version of logger to
604support these two scenarios. If logger is a library function, it may be
605impossible to create the generator logger because the logger function is
606opaque.
607
608  #include <fstream.hfa>
609  #include <coroutine.hfa>
610
611  forall( otype T | { void doit( T ); } )
612  void logger( T & t ) {
613      doit( t );
614  }
615
616  coroutine C {};
617  void main( C & c ) with( c ) {
618      void doit( C & c ) { suspend; }
619      logger( c );
620  }
621  void mem( C & c ) {
622      resume( c );
623  }
624
625  int main() {
626      C c;
627      mem( c );
628      mem( c );
629 
630      struct S {};
631      S s;
632      void doit( S & s ) {}
633      logger( s );
634  }
635
636Additional text has been added to the start of Section 3.2 addressing this
637comment.
638
639
640    In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
641    that embodies this "top down" approach. It starts, "programmers can now
642    answer three basic questions", and thus gives some practical advice for
643    which construct you should use and when. I think giving some examples of
644    specific applications that this paragraph, combined with some examples of
645    cases where each construct was needed, would be a better approach.
646
647    I don't think this comparison needs to be very long. It seems clear enough
648    that one would
649
650    * prefer generators for simple computations that yield up many values,
651    * prefer coroutines for more complex processes that have significant
652      internal structure,
653
654We do not believe the number of values yielded is an important factor is
655choosing between a generator or coroutine; either form can receive (input) or
656return (output) millions of values. As well, simple versus complex computations
657is also not an absolute criterion for selection as both forms can be simple or
658complex.
659
660As stated in the paper, a coroutine brings generality to the implementation
661because of the addition stack, whereas generators have restrictions on standard
662software-engineering practices: variable placement, no helper functions without
663creating an explicit generator stack, and no symmetric control-flow.  It is
664true that simple generators probably suffer less from restrictions. Also,
665simplicity means the amount of work done is small so the cost of the
666suspend/resume becomes a factor, where experiments show suspend/resume for a
667generator is 10 times faster than for a coroutines.
668
669
670    * prefer threads for cases where parallel execution is desired or needed.
671
672Agreed, but this description does not mention mutual exclusion and
673synchronization, which is essential in any meaningful concurrent program.
674
675Our point here is to illustrate that a casual "top down" explanation is
676insufficient to explain the complexity of the underlying execution properties.
677We presented some rule-of-thumbs at the end of Section 2 but programmers must
678understand all the underlying mechanisms and their interactions to exploit the
679execution properties to their fullest, and to understand when a programming
680language does or does not provide a desired mechanism.
681
682
683    I did appreciate the comparison in Section 2.3 between async-await in
684    JS/Java and generators/coroutines. I agree with its premise that those
685    mechanisms are a poor replacement for generators (and, indeed, JS has a
686    distinct generator mechanism, for example, in part for this reason).  I
687    believe I may have asked for this in a previous review, but having read it,
688    I wonder if it is really necessary, since those mechanisms are so different
689    in purpose.
690
691Given that you asked about this before, I believe other readers might also ask
692the same question because async-await is very popular. So I think this section
693does help to position the work in the paper among other work, and hence, it is
694appropriate to keep it in the paper.
695
696
697    I find the motivation for supporting both internal and external scheduling
698    to be fairly implicit. After several reads through the section, I came to
699    the conclusion that internal scheduling is more expressive than external
700    scheduling, but sometimes less convenient or clear. Is this correct? If
701    not, it'd be useful to clarify where external scheduling is more
702    expressive.
703
704    I would find it very interesting to try and capture some of the properties
705    that make internal vs external scheduling the better choice.
706
707    For example, it seems to me that external scheduling works well if there
708    are only a few "key" operations, but that internal scheduling might be
709    better otherwise, simply because it would be useful to have the ability to
710    name a signal that can be referenced by many methods.
711
712To address this point, the last paragraph on page 22 (now page 23) has been
713augmented to the following:
714
715 Given external and internal scheduling, what guidelines can a programmer use
716 to select between them?  In general, external scheduling is easier to
717 understand and code because only the next logical action (mutex function(s))
718 is stated, and the monitor implicitly handles all the details.  Therefore,
719 there are no condition variables, and hence, no wait and signal, which reduces
720 coding complexity and synchronization errors.  If external scheduling is
721 simpler than internal, why not use it all the time?  Unfortunately, external
722 scheduling cannot be used if: scheduling depends on parameter value(s) or
723 scheduling must block across an unknown series of calls on a condition
724 variable (\ie internal scheduling).  For example, the dating service cannot be
725 written using external scheduling.  First, scheduling requires knowledge of
726 calling parameters to make matching decisions and parameters of calling
727 threads are unavailable within the monitor.  Specifically, a thread within the
728 monitor cannot examine the @ccode@ of threads waiting on the calling queue to
729 determine if there is a matching partner.  (Similarly, if the bounded buffer
730 or readers/writer are restructured with a single interface function with a
731 parameter denoting producer/consumer or reader/write, they cannot be solved
732 with external scheduling.)  Second, a scheduling decision may be delayed
733 across an unknown number of calls when there is no immediate match so the
734 thread in the monitor must block on a condition.  Specifically, if a thread
735 determines there is no opposite calling thread with the same @ccode@, it must
736 wait an unknown period until a matching thread arrives.  For complex
737 synchronization, both external and internal scheduling can be used to take
738 advantage of best of properties of each.
739
740
741    Consider the bounded buffer from Figure 13: if it had multiple methods for
742    removing elements, and not just `remove`, then the `waitfor(remove)` call
743    in `insert` might not be sufficient.
744
745Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
746your request:
747
748  waitfor( remove : buffer ); or waitfor( remove2 : buffer );
749
750and its shorthand form (not shown in the paper)
751
752  waitfor( remove, remove2 : buffer );
753
754A call to either of these remove functions satisfies the waitfor (exact
755selection details are discussed in Section 6.4).
756
757
758    The same is true, I think, of the `signal_block` function, which I
759    have not encountered before;
760
761In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
762Structuring Concept", it states on page 551:
763
764 When a process signals a condition on which another process is waiting, the
765 signalling process must wait until the resumed process permits it to
766 proceed. We therefore introduce for each monitor a second semaphore "urgent"
767 (initialized to 0), on which signalling processes suspend themselves by the
768 operation P(urgent).
769
770Hence, the original definition of signal is in fact signal_block, i.e., the
771signaller blocks. Later implementations of monitor switched to signaller
772nonblocking because most signals occur before returns, which allows the
773signaller to continue execution, exit the monitor, and run concurrently with
774the signalled thread that restarts in the monitor. When the signaller is not
775going to exit immediately, signal_block is appropriate.
776
777
778    it seems like its behavior can be modeled with multiple condition
779    variables, but that's clearly more complex.
780
781Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
782Surveys, 27(1):63-107, that all extant monitors with different signalling
783semantics can be transformed into each other. However, some transformations are
784complex and runtime expensive.
785
786
787    One question I had about `signal_block`: what happens if one signals
788    but no other thread is waiting? Does it block until some other thread
789    waits? Or is that user error?
790
791On page 20, it states:
792
793  Signalling is unconditional because signalling an empty condition queue does
794  nothing.
795
796To the best of our knowledge, all monitors have the same semantics for
797signalling an empty condition queue, regardless of the kind of signal, i.e.,
798signal or signal_block.
799
800
801    I believe that one difference between the Go program and the Cforall
802    equivalent is that the Goroutine has an associated queue, so that
803    multiple messages could be enqueued, whereas the Cforall equivalent is
804    effectively a "bounded buffer" of length 1. Is that correct?
805
806Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
807send so both are synchronous communication.
808
809
810    I think this should be stated explicitly. (Presumably, one could modify the
811    Cforall program to include an explicit vector of queued messages if
812    desired, but you would also be reimplementing the channel abstraction.)
813
814Fixed, by adding the following sentences:
815
816  The different between call and channel send occurs for buffered channels
817  making the send asynchronous.  In \CFA, asynchronous call and multiple
818  buffers is provided using an administrator and worker
819  threads~\cite{Gentleman81} and/or futures (not discussed).
820
821
822    Also, in Figure 20, I believe that there is a missing `mutex` keyword.
823   
824Fixed.
825
826
827    I was glad to see that the paper acknowledged that Cforall still had
828    low-level atomic operations, even if their use is discouraged in favor of
829    higher-level alternatives.
830
831There was never an attempt to not acknowledged that Cforall had low-level
832atomic operations. The original version of the paper stated:
833
834  6.6 Low-level Locks
835  For completeness and efficiency, Cforall provides a standard set of low-level
836  locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
837  instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
838
839and that section is still in the paper. In fact, we use these low-level
840mechanisms to build all of the high-level concurrency constructs in Cforall.
841 
842
843    However, I still feel that the conclusion overstates the value of the
844    contribution here when it says that "Cforall high-level race-free monitors
845    and threads provide the core mechanisms for mutual exclusion and
846    synchronization, without the need for volatile and atomics". I feel
847    confident that Java programmers, for example, would be advised to stick
848    with synchronized methods whenever possible, and it seems to me that they
849    offer similar advantages -- but they sometimes wind up using volatiles for
850    performance reasons.
851
852I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
853programs can achieve very good performance without volatile or atomics because
854the runtime system has already used these low-level capabilities to build an
855efficient set of high-level concurrency constructs.
856
8570.1% of the time programmers need to build their own locks and synchronization
858mechanisms. This need also occurs for storage management. Both of these
859mechanisms are allowed in Cforall but are fraught with danger and should be
860discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
861fencing for a WSO memory model, such as on the ARM. It doesn't help that the
862C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
863Dave Dice and me would agree that 99% of hand-crafted locks created by
864programmers are broken and/or non-portable.
865
866
867    I was also confused by the term "race-free" in that sentence. In
868    particular, I don't think that Cforall has any mechanisms for preventing
869    *data races*, and it clearly doesn't prevent "race conditions" (which would
870    bar all sorts of useful programs). I suppose that "race free" here might be
871    referring to the improvements such as removing barging behavior.
872
873We use the term "race free" to mean the same as Boehm/Adve's "data-race
874freedom" in
875
876  Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
877  Memory Models. Communications ACM. 2012; 55(2):48-54.
878  https://queue.acm.org/detail.cfm?id=2088916
879
880which is cited in the paper. Furthermore, we never said that Cforall has
881mechanisms for preventing *all* data races. We said Cforall high-level
882race-free monitors and threads[, when used with mutex access function] (added
883to the paper), implies no data races within these constructs, unless a
884programmer directly publishes shared state. This approach is exactly what
885Boehm/Adve advocate for the vast majority of concurrent programming.
886
887
888    It would perhaps be more interesting to see a comparison built using [tokio] or
889    [async-std], two of the more prominent user-space threading libraries that
890    build on Rust's async-await feature (which operates quite differently than
891    Javascript's async-await, in that it doesn't cause every aync function call to
892    schedule a distinct task).
893
894Done.
895
896
897    Several figures used the `with` keyword. I deduced that `with(foo)` permits
898    one to write `bar` instead of `foo.bar`. It seems worth introducing.
899    Apologies if this is stated in the paper, if so I missed it.
900
901Page 6, footnote F states:
902
903  The Cforall "with" clause opens an aggregate scope making its fields directly
904  accessible, like Pascal "with", but using parallel semantics; multiple
905  aggregates may be opened.
906
907
908    On page 20, section 6.3, "external scheduling and vice versus" should be
909    "external scheduling and vice versa".
910
911Fixed.
912
913
914    On page 5, section 2.3, the paper states "we content" but it should be "we
915    contend".
916
917Fixed.
918
919
920    Page 1. I don't believe that it s fair to imply that Scala is "research
921    vehicle" as it is used by major players, Twitter being the most prominent
922    example.
923
924Fixed. Removed "research".
925
926
927    Page 15. Must Cforall threads start after construction (e.g. see your
928    example on page 15, line 21)?
929
930Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
931programmers want threads to start immediately. (Most Java programs have no code
932between a thread declaration and the call to start the thread.)  Therefore,
933this semantic should be the default because (see page 13):
934
935  Alternatives, such as explicitly starting threads as in Java, are repetitive
936  and forgetting to call start is a common source of errors.
937
938To handle the other 5% of the cases, there is a trivial Cforall pattern
939providing Java-style start/join. The additional cost for this pattern is small
940in comparison to the work performed between the start and join.
941
942  thread T {};
943  void start( T & mutex ) {} // any function name
944  void join( T & mutex ) {} // any function name
945  void main( T & t ) {
946      sout | "start";
947      waitfor( start : t ); // wait to be started
948      sout | "restart"; // perform work
949      waitfor( join : t ); // wait to be joined
950      sout | "join";
951  }
952
953  int main() {
954      T t[3]; // threads start and delay
955      sout | "need to start";
956      for ( i; 3 ) start( t[i] );
957      sout | "need to join";
958      for ( i; 3 ) join( t[i] );
959      sout | "threads stopped";
960  } // threads deleted from stack
961
962  $ a.out
963  need to start
964  start
965  start
966  start
967  need to join
968  restart
969  restart
970  restart
971  threads stopped
972  join
973  join
974  join
975
976
977    Page 18, line 17: is using
978
979Fixed.
Note: See TracBrowser for help on using the repository browser.