source: doc/papers/concurrency/response2 @ b54118a

arm-ehjacob/cs343-translationnew-astnew-ast-unique-expr
Last change on this file since b54118a was b54118a, checked in by Peter A. Buhr <pabuhr@…>, 18 months ago

update referee response

  • Property mode set to 100644
File size: 46.8 KB
Line 
1Reviewing: 1
2
3    I still have a couple of issues --- perhaps the largest is that it's
4    still not clear at this point in the paper what some of these options
5    are, or crucially how they would be used. I don't know if it's
6    possible to give high-level examples or use cases to be clear about
7    these up front - or if that would duplicate too much information from
8    later in the paper - either way expanding out the discussion - even if
9    just two a couple of sentences for each row - would help me more.
10
11Section 2.1 is changed to address this suggestion.
12
13
14    * 1st para section 2 begs the question: why not support each
15      dimension independently, and let the programmer or library designer
16      combine features?
17
18As seen in Table 1, not all of the combinations work, and having programmers
19directly use these low-level mechanisms is error prone. Accessing these
20fundamental mechanisms through higher-level constructs has always been the
21purpose of a programming language.
22
23
24    * Why must there "be language mechanisms to create, block/unblock, and join
25      with a thread"?  There aren't in Smalltalk (although there are in the
26      runtime).  Especially given in Cforall those mechanisms are *implicit* on
27      thread creation and destruction?
28
29The best description of Smalltalk concurrency I can find is in J. Hunt,
30Smalltalk and Object Orientation, Springer-Verlag London Limited, 1997, Chapter
3131 Concurrency in Smalltalk. It states on page 332:
32
33  For a process to be spawned from the current process there must be some way
34  of creating a new process. This is done using one of four messages to a
35  block. These messages are:
36
37    aBlock fork: This creates and schedules a process which will execute the
38    block. The priority of this process is inherited from the parent process.
39    ...
40
41  The Semaphore class provides facilities for achieving simple synchronization,
42  it is simple because it only allows for two forms of communication signal and
43  wait.
44
45Hence, "aBlock fork" creates, "Semaphore" blocks/unblocks (as does message send
46to an aBlock object), and garbage collection of an aBlock joins with its
47thread. The fact that a programmer *implicitly* does "fork", "block"/"unblock",
48and "join", does not change their fundamental requirement.
49
50
51   * "Case 1 is a function that borrows storage for its state (stack
52     frame/activation) and a thread from its invoker"
53 
54     this much makes perfect sense to me, but I don't understand how a
55     non-stateful, non-threaded function can then retain
56 
57     "this state across callees, ie, function local-variables are
58     retained on the stack across calls."
59 
60     how can it retain function-local values *across calls* when it
61     doesn't have any functional-local state?
62
63In the following example:
64
65  void foo() {
66     // local variables and code
67  }
68  void bar() {
69     // local variables
70     foo();
71  }
72
73bar is the caller and foo is the callee. bar borrows the program stack and
74thread to make the call to foo. When foo, the callee, is executing, bar's local
75variables (state) is retained on the *borrowed* stack across the call. (Note, I
76added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
77foo's local variables are also retain on the borrowed stack. When foo and bar
78return, all of their local state is gone (not retained). This behaviour is
79standard call/return semantics in an imperative language.
80
81
82     I'm not sure if I see two separate cases here - roughly equivalent
83     to C functions without static storage, and then C functions *with*
84     static storage.
85
86Yes, but there is only one instance of the static storage across all
87activations of the C function. For generators and coroutines, each instance has
88its own state, like an object in an OO language.
89
90
91     I assumed that was the distinction between cases 1 & 3; but perhaps the
92     actual distinction is that 3 has a suspend/resume point, and so the
93     "state" in figure 1 is this component of execution state (viz figs 1 & 2),
94     not the state representing the cross-call variables?
95
96So case 3 is like an object with the added ability to retain where it was last
97executing.  When a generator is resumed, the generator object (structure
98instance) is passed as an explicit reference, and within this object is the
99restart location in the generator's "main". When the generator main executes,
100it uses the borrowed stack for its local variables and any functions it calls,
101just like an object member borrows the stack for its local variables but also
102has an implicit receiver to the object state.  A generator can have static
103storage, too, which is a single instance across all generator instances of that
104type, as for static storage in an object type. All the kinds of storage are
105at play with semantics that is virtually the same as in other languages.
106
107
108    > but such evaluation isn't appropriate for garbage-collected or JITTed
109    > languages like Java or Go.
110    For JITTed languages in particular, reporting peak performance needs to
111    "warm up" the JIT with a number of iterators before beginning
112    measurement. Actually for JIT's its even worse: see Edd Barrett et al
113    OOPSLA 2017.
114   
115Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
116test-programs correctly measured the specific feature, we consulted with Dave
117Dice at Oracle who works directly on the development of the Oracle JVM
118Just-in-Time Compiler. We modified our test programs based on his advise, and
119he validated our programs as correctly measuring the specified language
120feature. Hence, we have taken into account all issues related to performing
121benchmarks in JITTED languages.  Dave's help is recognized in the
122Acknowledgment section. Also, all the benchmark programs are publicly available
123for independent verification.
124
125Similarly, we verified our Node.js programs with Gregor Richards, an expert in
126just-in-time compilation for dynamic typing.
127
128
129   * footnote A - I've looked at various other papers & the website to try to
130     understand how "object-oriented" Cforall is - I'm still not sure.  This
131     footnote says Cforall has "virtuals" - presumably virtual functions,
132     i.e. dynamic dispatch - and inheritance: that really is OO as far as I
133     (and most OO people) are concerned.  For example Haskell doesn't have
134     inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
135     or things like Cecil and Dylan are considered OO even though they have
136     "multiple function parameters as receivers", lack "lexical binding between
137     a structure and set of functions", and don't have explicit receiver
138     invocation syntax.  Python has receiver syntax, but unlike Java or
139     Smalltalk or C++, method declarations still need to have an explicit
140     "self" receiver parameter.  Seems to me that Go, for example, is
141     more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
142     and an explicit receiver syntax but that's not determinative); while Rust
143     lacks dynamic dispatch built-in.  C is not OO as a language, but as you
144     say given it supports function pointers with structures, it does support
145     an OO programming style.
146   
147     This is why I again recommend just not buying into this fight: not making
148     any claims about whether Cforall is OO or is not - because as I see it,
149     the rest of the paper doesn't depend on whether Cforall is OO or not.
150     That said: this is just a recommendation, and I won't quibble over this
151     any further.
152
153We believe it is important to identify Cforall as a non-OO language because it
154heavily influences the syntax and semantics used to build its concurrency.
155Since many aspects of Cforall are not OO, the rest of the paper *does* depend
156on Cforall being identified as non-OO, otherwise readers would have
157significantly different expectations for the design. We believe your definition
158of OO is too broad, such as including C. Just because a programming language
159can support aspects of the OO programming style, does not make it OO. (Just
160because a duck can swim does not make it a fish.)
161
162Our definition of non-OO follows directly from the Wikipedia entry:
163
164  Object-oriented programming (OOP) is a programming paradigm based on the
165  concept of "objects", which can contain data, in the form of fields (often
166  known as attributes or properties), and code, in the form of procedures
167  (often known as methods). A feature of objects is an object's procedures that
168  can access and often modify the data fields of the object with which they are
169  associated (objects have a notion of "this" or "self").
170  https://en.wikipedia.org/wiki/Object-oriented_programming
171
172Cforall fails this definition as code cannot appear in an "object" and there is
173no implicit receiver. As well, Cforall, Go, and Rust do not have nominal
174inheritance and they not considered OO languages, e.g.:
175
176 "**Is Go an object-oriented language?** Yes and no. Although Go has types and
177 methods and allows an object-oriented style of programming, there is no type
178 hierarchy. The concept of "interface" in Go provides a different approach
179 that we believe is easy to use and in some ways more general. There are also
180 ways to embed types in other types to provide something analogous-but not
181 identical-to subclassing. Moreover, methods in Go are more general than in
182 C++ or Java: they can be defined for any sort of data, even built-in types
183 such as plain, "unboxed" integers. They are not restricted to structs (classes).
184 https://golang.org/doc/faq#Is_Go_an_object-oriented_language
185
186
187   * is a "monitor function" the same as a "mutex function"?
188     if so the paper should pick one term; if not, make the distinction clear.
189
190Fixed. Picked "mutex". Changed the language and all places in the paper.
191
192
193   * "As stated on line 1 because state declarations from the generator
194      type can be moved out of the coroutine type into the coroutine main"
195 
196      OK sure, but again: *why* would a programmer want to do that?
197      (Other than, I guess, to show the difference between coroutines &
198      generators?)  Perhaps another way to put this is that the first
199      para of 3.2 gives the disadvantages of coroutines vs-a-vs
200      generators, briefly describes the extended semantics, but never
201      actually says why a programmer may want those extended semantics,
202      or how they would benefit.  I don't mean to belabour the point,
203      but (generalist?) readers like me would generally benefit from
204      those kinds of discussions about each feature throughout the
205      paper: why might a programmer want to use them?
206
207On page 8, it states:
208
209  Having to manually create the generator closure by moving local-state
210  variables into the generator type is an additional programmer burden (removed
211  by the coroutine in Section 3.2). ...
212
213also these variables can now be refactored into helper function, where the
214helper function suspends at arbitrary call depth. So imagine a coroutine helper
215function that is only called occasionally within the coroutine but it has a
216large array that is retained across suspends within the helper function. For a
217generator, this large array has to be declared in the generator type enlarging
218each generator instance even through the array is only used occasionally.
219Whereas, the coroutine only needs the array allocated when needed. Now a
220coroutine has a stack which occupies storage, but the maximum stack size only
221needs to be the call chain allocating the most storage, where as the generator
222has a maximum size of all variable that could be created.
223
224
225    > p17 if the multiple-monitor entry procedure really is novel, write a paper
226    > about that, and only about that.
227    > We do not believe this is a practical suggestion.
228    * I'm honestly not trying to be snide here: I'm not an expert on monitor or
229      concurrent implementations. Brinch Hansen's original monitors were single
230      acquire; this draft does not cite any other previous work that I could
231      see. I'm not suggesting that the brief mention of this mechanism
232      necessarily be removed from this paper, but if this is novel (and a clear
233      advance over a classical OO monitor a-la Java which only acquires the
234      distinguished receiver) then that would be worth another paper in itself.
235
236First, to explain multiple-monitor entry in Cforall as a separate paper would
237require significant background on Cforall concurrency, which means repeating
238large sections of this paper. Second, it feels like a paper just on
239multiple-monitor entry would be a little thin, even if the capability is novel
240to the best of our knowledge. Third, we feel multiple-monitor entry springs
241naturally from the overarching tone in the paper that Cforall is a non-OO
242programming language, allowing multiple-mutex receivers.
243
244
245    My typo: the paper's conclusion should come at the end, after the
246    future work section.
247
248Combined into a Conclusions and Future Work section.
249
250
251
252Reviewing: 2
253
254    on the Boehm paper and whether code is "all sequential to the compiler": I
255    now understand the authors' position better and suspect we are in violent
256    agreement, except for whether it's appropriate to use the rather breezy
257    phrase "all sequential to the compiler". It would be straightforward to
258    clarify that code not using the atomics features is optimized *as if* it
259    were sequential, i.e. on the assumption of a lack of data races.
260
261Fixed, "as inline and library code is compiled as sequential without any
262explicit concurrent directive."
263
264
265    on the distinction between "mutual exclusion" and "synchronization": the
266    added citation does help, in that it makes a coherent case for the
267    definition the authors prefer. However, the text could usefully clarify
268    that this is a matter of definition not of fact, given especially that in
269    my assessment the authors' preferred definition is not the most common
270    one. (Although the mention of Hoare's apparent use of this definition is
271    one data point, countervailing ones are found in many contemporaneous or
272    later papers, e.g. Habermann's 1972 "Synchronization of Communicating
273    Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
274    eventcounts and sequencers" (CACM (22(2)) and so on.)
275
276Fixed, "We contend these two properties are independent, ...".
277
278With respect to the two papers, Habermann fundamentally agrees with our
279definitions, where the term mutual exclusion is the same but Habermann uses the
280term "communication" for our synchronization. However, the term "communication"
281is rarely used to mean synchronization. The fact that Habermann collectively
282calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
283state:
284
285  By mutual exclusion we mean any mechanism that forces the time ordering of
286  execution of pieces of code, called critical regions, in a system of
287  concurrent processes to be a total ordering.
288
289But there is no timing order for a critical region (which I assume means
290critical section); threads can arrive at any time in any order, where the
291mutual exclusion for the critical section ensures one thread executes at a
292time. Interestingly, Reed & Kanodia's critical region is Habermann's
293communication not critical section. These papers only buttress our contention
294about the confusion of these terms in the literature.
295
296
297    section 2 (an expanded version of what was previously section 5.9) lacks
298    examples and is generally obscure and allusory ("the most advanced feature"
299    -- name it! "in triplets" -- there is only one triplet!;
300
301Fixed.
302
303 These independent properties can be used to compose different language
304 features, forming a compositional hierarchy, where the combination of all
305 three is the most advanced feature, called a thread/task/process. While it is
306 possible for a language to only provide threads for composing
307 programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
308 solutions to certain classes of problems.
309
310
311    what are "execution locations"? "initialize" and "de-initialize"
312
313Fixed through an example at the end of the sentence.
314
315 \item[\newterm{execution state}:] is the state information needed by a
316 control-flow feature to initialize, manage compute data and execution
317 location(s), and de-initialize, \eg calling a function initializes a stack
318 frame including contained objects with constructors, manages local data in
319 blocks and return locations during calls, and de-initializes the frame by
320 running any object destructors and management operations.
321
322
323    what? "borrowed from the invoker" is a concept in need of explaining or at
324    least a fully explained example -- in what sense does a plain function
325    borrow" its stack frame?
326
327A function has no storage except for static storage; it gets its storage from
328the program stack. For a function to run it must borrow storage from somewhere.
329When the function returns, the borrowed storage is returned. Do you have a more
330appropriate word?
331
332    "computation only" as opposed to what?
333
334This is a term in concurrency for operations that compute without blocking,
335i.e., the operation starts with everything it needs to compute its result and
336runs to completion, blocking only when it is done and returns its result.
337
338
339    in 2.2, in what way is a "request" fundamental to "synchronization"?
340
341I assume you are referring to the last bullet.
342
343 Synchronization must be able to control the service order of requests
344 including prioritizing selection from different kinds of outstanding requests,
345 and postponing a request for an unspecified time while continuing to accept
346 new requests.
347
348Habermann states on page 173
349
350 Looking at deposit we see that sender and receiver should be synchronized with
351 respect to buffer overflow: deposit of another message must be delayed if
352 there is no empty message frame, and such delay should be removed when the
353 first empty frame becomes available.  This is programmed as
354 synchronization(frame) : deposit is preceded by wait(frame), accept is
355 followed by signal( frame), and the constant C[frame] is set equal to
356 "bufsize."
357
358Here synchronization is controlling the service order of requests: when the
359buffer is full, requests for insert (sender) are postponed until a request to
360remove (receiver) occurs.  Without the ability to control service order among
361requests, the producer/consumer problem with a bounded buffer cannot be
362solved. Hence, this capability is fundamental.
363
364
365    and the "implicitly" versus "explicitly" point needs stating as elsewhere,
366    with a concrete example e.g. Java built-in mutexes versus
367    java.util.concurrent).
368
369Fixed.
370
371 MES must be available implicitly in language constructs, \eg Java built-in
372 monitors, as well as explicitly for specialized requirements, \eg
373 @java.util.concurrent@, because requiring programmers to build MES using
374 low-level locks often leads to incorrect programs.
375
376
377    section 6: 6.2 omits the most important facts in preference for otherwise
378    inscrutable detail: "identify the kind of parameter" (first say *that there
379    are* kinds of parameter, and what "kinds" means!); "mutex parameters are
380    documentation" is misleading (they are also semantically significant!) and
381    fails to say *what* they mean; the most important thing is surely that
382    'mutex' is a language feature for performing lock/unlock operations at
383    function entry/exit. So say it!
384
385These sections have been rewritten address the comments.
386
387
388    The meanings of examples f3 and f4 remain unclear.
389
390Rewrote the paragraph.
391
392
393    Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
394    meaning from Figure 12,
395
396Defined Hoare's urgent list at the start of the paragraph.
397
398
399    but that Figure is incomprehensible to me), and we are told of "external
400    scheduling"'s long history in Ada but not clearly what it actually means;
401
402We do not know how to address your comment because the description is clear to
403us and other non-reviewers who have read the paper. I had forgotten an
404important citation to prior work on this topic, which is now referenced:
405
406   Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
407   Classification. ACM Computing Surveys. 1995; 27(1):63-107.
408
409This citation is a 45 page paper expanding on the topic of internal scheduling.
410I also added a citation to Hoare's monitor paper where signal_block is defined:
411
412  When a process signals a condition on which another process is waiting, the
413  signalling process must wait until the resumed process permits it to
414  proceed.
415
416External scheduling from Ada was subsequently added to uC++ and Cforall.
417Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
418external-scheduling with Go channel external-scheduling. So external scheduling
419exists in languages beyond Ada.
420
421
422    6.4's description of "waitfor" tells us it is different from an if-else
423    chain but tries to use two *different* inputs to tell us that the behavior
424    is different; tell us an instance where *the same* values of C1 and C2 give
425    different behavior (I even wrote out a truth table and still don't see the
426    semantic difference)
427
428Again, it is unclear what is the problem. For the if-statement, if C1 is true,
429it only waits for a call to mem1, even if C2 is true and there is an
430outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
431is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
432happen with the if-statement. So for all true when clauses, any outstanding
433call is immediately accepted. If there are no outstanding calls, the waitfor
434blocks until the next call to any of the true when clauses occurs. I added the
435following sentence to further clarify.
436
437 Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
438
439Note, the same parallel semantics exists for the Go select statement with
440respect to waiting for a set of channels to receive data. While Go select does
441not have a when clause, it would be trivial to add it, making the select more
442expressive.
443
444
445    The authors frequently use bracketed phrases, and sometimes slashes "/", in
446    ways that are confusing and/or detrimental to readability.  Page 13 line
447    2's "forward (backward)" is one particularly egregious example.  In general
448    I would recommend the the authors try to limit their use of parentheses and
449    slashes as a means of forcing a clearer wording to emerge.
450
451Many of the slashes and parentheticals have been removed. Some are retained to
452express joined concepts: call/return, suspend/resume, resume/resume, I/O.
453
454
455    Also, the use of "eg." is often cursory and does not explain the examples
456    given, which are frequently a one- or two-word phrase of unclear referent.
457
458A few of these are fixed.
459
460
461    Considering the revision more broadly, none of the more extensive or
462    creative rewrites I suggested in my previous review have been attempted,
463    nor any equivalent efforts to improve its readability.
464
465If you reread the previous response, we addressed all of your suggestions except
466
467        An expositional idea occurs: start the paper with a strawman
468        naive/limited realisation of coroutines -- say, Simon Tatham's popular
469        "Coroutines in C" web page -- and identify point by point what the
470        limitations are and how C\/ overcomes them. Currently the presentation
471        is often flat (lacking motivating contrasts) and backwards (stating
472        solutions before problems). The foregoing approach might fix both of
473        these.
474   
475    We prefer the current structure of our paper and believe the paper does
476    explain basic coding limitations and how they are overcome in using
477    high-level control-floe mechanisms.
478
479and we have addressed readability issues in this version.
480
481
482    The hoisting of the former section 5.9 is a good idea, but the newly added
483    material accompanying it (around Table 1) suffers fresh deficiencies in
484    clarity. Overall the paper is longer than before, even though (as my
485    previous review stated), I believe a shorter paper is required in order to
486    serve the likely purpose of publication. (Indeed, the authors' letter
487    implies that a key goal of publication is to build community and gain
488    external users.)
489
490This comment is the referee's opinion, which we do not agree with it. Our
491previous 35 page SP&E paper on Cforall:
492
493  Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
494  Programming Language Features to C. Softw. Pract. Exper. 2018;
495  48(12):2111-2146.
496
497has a similar structure and style to this paper, and it received an award from
498John Wiley & Sons:
499
500 Software: Practice & Experience for articles published between January 2017
501 and December 2018, "most downloads in the 12 months following online
502 publication showing the work generated immediate impact and visibility,
503 contributing significantly to advancement in the field".
504
505So we have demonstrated an ability to build community and gain external users.
506
507 
508    Given this trajectory, I no longer see a path to an acceptable revision of
509    the present submission. Instead I suggest the authors consider splitting
510    the paper in two: one half about coroutines and stack management, the other
511    about mutexes, monitors and the runtime. (A briefer presentation of the
512    runtime may be helpful in the first paper also, and a brief recap of the
513    generator and coroutine support is obviously needed in the second too.)
514
515Again we disagree with the referee's suggestion to vastly restructure the
516paper. What advantage is there is presenting exactly the same material across
517two papers, which will likely end up longer than a single paper? The current
518paper has a clear theme that fundamental execution properties generate a set of
519basic language mechanisms, and we then proceed to show how these mechanisms can
520be designed into the programing language Cforall.
521
522
523    I do not buy the authors' defense of the limited practical experience or
524    "non-micro" benchmarking presented. Yes, gaining external users is hard and
525    I am sympathetic on that point. But building something at least *somewhat*
526    substantial with your own system should be within reach, and without it the
527    "practice and experience" aspects of the work have not been explored.
528    Clearly C\/ is the product of a lot of work over an extended period, so it
529    is a surprise that no such experience is readily available for inclusion.
530
531Understood. There are no agreed-upon concurrency benchmarks, which is why
532micro-benchmarks are often used. Currently, the entire Cforall runtime is
533written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
534be thread safe, automatically detects the use of concurrency features at link
535time, and bootstraps into a threaded runtime almost immediately at program
536startup so threads can be declared as global variables and may run to
537completion before the program main starts. The concurrent core of the runtime
538is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
539and high-level features. In other words, the concurrent core uses itself as
540quickly as possible to start using high-level concurrency. There are 12,000+
541LOC in the Cforall tests-suite used to verify language features, which are run
542nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
543as aggressively testing each language feature, and classical examples such as
544bounded buffer, dating service, matrix summation, quickSort, binary insertion
545sort, etc.  More experience will be available soon, based on ongoing work in
546the "future works" section. Specifically, non-blocking I/O is working with the
547new Linux io_uring interface and a new high-performance ready-queue is under
548construction to take into account this change. With non-blocking I/O, it will
549be possible to write applications like high-performance web servers, as is now
550done in Rust and Go. Also, completed is Java-style executors for work-based
551concurrent programming and futures. Under construction is a high-performance
552actor system.
553
554
555    It does not seem right to state that a stack is essential to Von Neumann
556    architectures -- since the earliest Von Neumann machines (and indeed early
557    Fortran) did not use one.
558
559Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
560https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
561
562  Since a subprogram may call for other subprograms to any desired depth, a
563  particular CALL statement may be defined by a pyramid of multi-level
564  subprograms.
565
566I think we may be differing on the meaning of stack. You may be imagining a
567modern stack that grows and shrinks dynamically. Whereas early Fortran
568preallocated a stack frame for each function, like Python allocates a frame for
569a generator.  Within each preallocated Fortran function is a frame for local
570variables and a pointer to store the return value for a call.  The Fortran
571call/return mechanism uses these frames to build a traditional call stack
572linked by the return pointer. The only restriction is that a function stack
573frame can only be used once, implying no direct or indirect recursion.  Hence,
574without a stack mechanism, there can be no call/return to "any desired depth",
575where the maximum desired depth is limited by the number of functions. So
576call/return requires some form of a stack, virtually all programming language
577have call/return, past and present, and these languages run on Von Neumann
578machines that do not distinguish between program and memory space, have mutable
579state, and the concept of a pointer to data or code.
580
581
582    To elaborate on something another reviewer commented on: it is a surprise
583    to find a "Future work" section *after* the "Conclusion" section. A
584    "Conclusions and future work" section often works well.
585
586Done.
587
588
589
590Reviewing: 3
591
592    but it remains really difficult to have a good sense of which idea I should
593    use and when. This applies in different ways to different features from the
594    language:
595
596    * coroutines/generators/threads: here there is some discussion, but it can
597      be improved.
598    * interal/external scheduling: I didn't find any direct comparison between
599      these features, except by way of example.
600
601See changes below.
602
603
604    I would have preferred something more like a table or a few paragraphs
605    highlighting the key reasons one would pick one construct or the other.
606
607Section 2.1 is changed to address this suggestion.
608
609
610    The discussion of clusters and pre-emption in particular feels quite rushed.
611
612We believe a brief introduction to the Cforall runtime structure is important
613because clustering within a user-level versus distributed system is unusual,
614Furthermore, the explanation of preemption is important because several new
615languages, like Go and Rust tokio, are not preemptive. Rust threads are
616preemptive only because it uses kernel threads, which UNIX preempts.
617
618
619    * Recommend to shorten the comparison on coroutine/generator/threads in
620      Section 2 to a paragraph with a few examples, or possibly a table
621      explaining the trade-offs between the constructs
622
623Not done, see below.
624
625
626    * Recommend to clarify the relationship between internal/external
627      scheduling -- is one more general but more error-prone or low-level?
628
629Done, see below.
630
631
632    There is obviously a lot of overlap between these features, and in
633    particular between coroutines and generators. As noted in the previous
634    review, many languages have chosen to offer *only* generators, and to
635    build coroutines by stacks of generators invoking one another.
636
637Coroutines built from stacks of generators have problems, such as no symmetric
638control-flow. Furthermore, stacks of generators have a problem with the
639following programming pattern.  logger is a library function called from a
640function or a coroutine, where the doit for the coroutine suspends. With stacks
641of generators, there has to be a function and generator version of logger to
642support these two scenarios. If logger is a library function, it may be
643impossible to create the generator logger because the logger function is
644opaque.
645
646  #include <fstream.hfa>
647  #include <coroutine.hfa>
648
649  forall( otype T | { void doit( T ); } )
650  void logger( T & t ) {
651      doit( t );
652  }
653
654  coroutine C {};
655  void main( C & c ) with( c ) {
656      void doit( C & c ) { suspend; }
657      logger( c );
658  }
659  void mem( C & c ) {
660      resume( c );
661  }
662
663  int main() {
664      C c;
665      mem( c );
666      mem( c );
667 
668      struct S {};
669      S s;
670      void doit( S & s ) {}
671      logger( s );
672  }
673
674Additonal text has been added to the start of Section 3.2 address this comment.
675
676
677    In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
678    that embodies this "top down" approach. It starts, "programmers can now
679    answer three basic questions", and thus gives some practical advice for
680    which construct you should use and when. I think giving some examples of
681    specific applications that this paragraph, combined with some examples of
682    cases where each construct was needed, would be a better approach.
683
684    I don't think this comparison needs to be very long. It seems clear enough
685    that one would
686
687    * prefer generators for simple computations that yield up many values,
688    * prefer coroutines for more complex processes that have significant
689      internal structure,
690
691We do not believe the number of values yielded is an important factor is
692choosing between a generator or coroutine; either form can receive (input) or
693return (output) millions of values. As well, simple versus complex computations
694is also not an absolute criterion for selection as both forms can be simple or
695complex.
696
697As stated in the paper, a coroutine brings generality to the implementation
698because of the addition stack, whereas generators have restrictions on standard
699software-engineering practices: variable placement, no helper functions without
700creating an explicit generator stack, and no symmetric control-flow.  It is
701true that simple generators probably suffer less from restrictions. Also,
702simplicity means the amount of work done is small so the cost of the
703suspend/resume becomes a factor, where experiments show suspend/resume for a
704generator is 10 times faster than for a coroutines.
705
706
707    * prefer threads for cases where parallel execution is desired or needed.
708
709Agreed, but this description does not mention mutual exclusion and
710synchronization, which is essential in any meaningful concurrent program.
711
712Our point here is to illustrate that a casual "top down" explanation is
713insufficient to explain the complexity of the underlying execution properties.
714We presented some rule-of-thumbs at the end of Section 2 but programmers must
715understand all the underlying mechanisms and their interactions to exploit the
716execution properties to their fullest, and to understand when a programming
717language does or does not provide a desired mechanism.
718
719
720    I did appreciate the comparison in Section 2.3 between async-await in
721    JS/Java and generators/coroutines. I agree with its premise that those
722    mechanisms are a poor replacement for generators (and, indeed, JS has a
723    distinct generator mechanism, for example, in part for this reason).  I
724    believe I may have asked for this in a previous review, but having read it,
725    I wonder if it is really necessary, since those mechanisms are so different
726    in purpose.
727
728Given that you asked about this before, I believe other readers might also ask
729the same question because async-await is very popular. So I think this section
730does help to position the work in the paper among other work, and hence, it is
731appropriate to keep it in the paper.
732
733
734    I find the motivation for supporting both internal and external scheduling
735    to be fairly implicit. After several reads through the section, I came to
736    the conclusion that internal scheduling is more expressive than external
737    scheduling, but sometimes less convenient or clear. Is this correct? If
738    not, it'd be useful to clarify where external scheduling is more
739    expressive.
740
741    I would find it very interesting to try and capture some of the properties
742    that make internal vs external scheduling the better choice.
743
744    For example, it seems to me that external scheduling works well if there
745    are only a few "key" operations, but that internal scheduling might be
746    better otherwise, simply because it would be useful to have the ability to
747    name a signal that can be referenced by many methods.
748
749To address this point, the last paragraph on page 22 (now page 23) has been
750augmented to the following:
751
752 Given external and internal scheduling, what guidelines can a programmer use
753 to select between them?  In general, external scheduling is easier to
754 understand and code because only the next logical action (mutex function(s))
755 is stated, and the monitor implicitly handles all the details.  Therefore,
756 there are no condition variables, and hence, no wait and signal, which reduces
757 coding complexity and synchronization errors.  If external scheduling is
758 simpler than internal, why not use it all the time?  Unfortunately, external
759 scheduling cannot be used if: scheduling depends on parameter value(s) or
760 scheduling must block across an unknown series of calls on a condition
761 variable (\ie internal scheduling).  For example, the dating service cannot be
762 written using external scheduling.  First, scheduling requires knowledge of
763 calling parameters to make matching decisions and parameters of calling
764 threads are unavailable within the monitor.  Specifically, a thread within the
765 monitor cannot examine the @ccode@ of threads waiting on the calling queue to
766 determine if there is a matching partner.  (Similarly, if the bounded buffer
767 or readers/writer are restructured with a single interface function with a
768 parameter denoting producer/consumer or reader/write, they cannot be solved
769 with external scheduling.)  Second, a scheduling decision may be delayed
770 across an unknown number of calls when there is no immediate match so the
771 thread in the monitor must block on a condition.  Specifically, if a thread
772 determines there is no opposite calling thread with the same @ccode@, it must
773 wait an unknown period until a matching thread arrives.  For complex
774 synchronization, both external and internal scheduling can be used to take
775 advantage of best of properties of each.
776
777
778    Consider the bounded buffer from Figure 13: if it had multiple methods for
779    removing elements, and not just `remove`, then the `waitfor(remove)` call
780    in `insert` might not be sufficient.
781
782Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
783your request:
784
785  waitfor( remove : buffer ); or waitfor( remove2 : buffer );
786
787and its shorthand form (not shown in the paper)
788
789  waitfor( remove, remove2 : t );
790
791A call to one these remove functions satisfies the waitfor (exact selection
792details are discussed in Section 6.4).
793
794
795    The same is true, I think, of the `signal_block` function, which I
796    have not encountered before;
797
798In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
799Structuring Concept", it states on page 551:
800
801 When a process signals a condition on which another process is waiting, the
802 signalling process must wait until the resumed process permits it to
803 proceed. We therefore introduce for each monitor a second semaphore "urgent"
804 (initialized to 0), on which signalling processes suspend themselves by the
805 operation P(urgent).
806
807Hence, the original definition of signal is in fact signal_block, i.e., the
808signaller blocks. Later implementations of monitor switched to signaller
809nonblocking because most signals occur before returns, which allows the
810signaller to continue execution, exit the monitor, and run concurrently with
811the signalled thread that restarts in the monitor. When the signaller is not
812going to exit immediately, signal_block is appropriate.
813
814
815    it seems like its behavior can be modeled with multiple condition
816    variables, but that's clearly more complex.
817
818Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
819Surveys, 27(1):63-107, that all extant monitors with different signalling
820semantics can be transformed into each other. However, some transformations are
821complex and runtime expensive.
822
823
824    One question I had about `signal_block`: what happens if one signals
825    but no other thread is waiting? Does it block until some other thread
826    waits? Or is that user error?
827
828On page 20, it states:
829
830  Signalling is unconditional because signalling an empty condition queue does
831  nothing.
832
833To the best of our knowledge, all monitors have the same semantics for
834signalling an empty condition queue, regardless of the kind of signal, i.e.,
835signal or signal_block.
836
837    I believe that one difference between the Go program and the Cforall
838    equivalent is that the Goroutine has an associated queue, so that
839    multiple messages could be enqueued, whereas the Cforall equivalent is
840    effectively a "bounded buffer" of length 1. Is that correct?
841
842Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
843send so both are synchronous communication.
844
845    I think this should be stated explicitly. (Presumably, one could modify the
846    Cforall program to include an explicit vector of queued messages if
847    desired, but you would also be reimplementing the channel abstraction.)
848
849Fixed, by adding the following sentences:
850
851  The different between call and channel send occurs for buffered channels
852  making the send asynchronous.  In \CFA, asynchronous call and multiple
853  buffers is provided using an administrator and worker
854  threads~\cite{Gentleman81} and/or futures (not discussed).
855
856
857    Also, in Figure 20, I believe that there is a missing `mutex` keyword.
858   
859Fixed.
860
861
862    I was glad to see that the paper acknowledged that Cforall still had
863    low-level atomic operations, even if their use is discouraged in favor of
864    higher-level alternatives.
865
866There was never an attempt to not acknowledged that Cforall had low-level
867atomic operations. The original version of the paper stated:
868
869  6.6 Low-level Locks
870  For completeness and efficiency, Cforall provides a standard set of low-level
871  locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
872  instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
873
874and that section is still in the paper. In fact, we use these low-level
875mechanisms to build all of the high-level concurrency constructs in Cforall.
876 
877
878    However, I still feel that the conclusion overstates the value of the
879    contribution here when it says that "Cforall high-level race-free monitors
880    and threads provide the core mechanisms for mutual exclusion and
881    synchronization, without the need for volatile and atomics". I feel
882    confident that Java programmers, for example, would be advised to stick
883    with synchronized methods whenever possible, and it seems to me that they
884    offer similar advantages -- but they sometimes wind up using volatiles for
885    performance reasons.
886
887I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
888programs can achieve very good performance without volatile or atomics because
889the runtime system has already used these low-level capabilities to build an
890efficient set of high-level concurrency constructs.
891
8920.1% of the time programmers need to build their own locks and synchronization
893mechanisms. This need also occurs for storage management. Both of these
894mechanisms are allowed in Cforall but are fraught with danger and should be
895discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
896fencing for a WSO memory model, such as on the ARM. It doesn't help that the
897C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
898Dave Dice and me would agree that 99% of hand-crafted locks created by
899programmers are broken and/or non-portable.
900
901
902    I was also confused by the term "race-free" in that sentence. In
903    particular, I don't think that Cforall has any mechanisms for preventing
904    *data races*, and it clearly doesn't prevent "race conditions" (which would
905    bar all sorts of useful programs). I suppose that "race free" here might be
906    referring to the improvements such as removing barging behavior.
907
908We use the term "race free" to mean the same as Boehm/Adve's "data-race
909freedom" in
910
911  Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
912  Memory Models. Communications ACM. 2012; 55(2):48-54.
913  https://queue.acm.org/detail.cfm?id=2088916
914
915which is cited in the paper. Furthermore, we never said that Cforall has
916mechanisms for preventing *all* data races. We said Cforall high-level
917race-free monitors and threads[, when used with mutex access function] (added
918to the paper), implies no data races within these constructs, unless a
919programmer directly publishes shared state. This approach is exactly what
920Boehm/Adve advocate for the vast majority of concurrent programming.
921
922
923    It would perhaps be more interesting to see a comparison built using [tokio] or
924    [async-std], two of the more prominent user-space threading libraries that
925    build on Rust's async-await feature (which operates quite differently than
926    Javascript's async-await, in that it doesn't cause every aync function call to
927    schedule a distinct task).
928
929Done.
930
931
932    Several figures used the `with` keyword. I deduced that `with(foo)` permits
933    one to write `bar` instead of `foo.bar`. It seems worth introducing.
934    Apologies if this is stated in the paper, if so I missed it.
935
936Page 6, footnote F states:
937
938  The Cforall "with" clause opens an aggregate scope making its fields directly
939  accessible, like Pascal "with", but using parallel semantics; multiple
940  aggregates may be opened.
941
942
943    On page 20, section 6.3, "external scheduling and vice versus" should be
944    "external scheduling and vice versa".
945
946Fixed.
947
948
949    On page 5, section 2.3, the paper states "we content" but it should be "we
950    contend".
951
952Fixed.
953
954
955    Page 1. I don't believe that it s fair to imply that Scala is "research
956    vehicle" as it is used by major players, Twitter being the most prominent
957    example.
958
959Fixed. Removed "research".
960
961
962    Page 15. Must Cforall threads start after construction (e.g. see your
963    example on page 15, line 21)?
964
965Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
966programmers want threads to start immediately. (Most Java programs have no code
967between a thread declaration and the call to start the thread.)  Therefore,
968this semantic should be the default because (see page 13):
969
970  Alternatives, such as explicitly starting threads as in Java, are repetitive
971  and forgetting to call start is a common source of errors.
972
973To handle the other 5% of the cases, there is a trivial Cforall pattern
974providing Java-style start/join. The additional cost for this pattern is small
975in comparison to the work performed between the start and join.
976
977  thread T {};
978  void start( T & mutex ) {} // any function name
979  void join( T & mutex ) {} // any function name
980  void main( T & t ) {
981      sout | "start";
982      waitfor( start : t ); // wait to be started
983      sout | "restart"; // perform work
984      waitfor( join : t ); // wait to be joined
985      sout | "join";
986  }
987  int main() {
988      T t[3]; // threads start and delay
989      sout | "need to start";
990      for ( i; 3 ) start( t[i] );
991      sout | "need to join";
992      for ( i; 3 ) join( t[i] );
993      sout | "threads stopped";
994  } // threads deleted from stack
995
996  $ a.out
997  need to start
998  start
999  start
1000  start
1001  need to join
1002  restart
1003  restart
1004  restart
1005  threads stopped
1006  join
1007  join
1008  join
1009
1010
1011    Page 18, line 17: is using
1012
1013Fixed.
Note: See TracBrowser for help on using the repository browser.