| 1 |     A revised version of your manuscript that takes into account the comments
 | 
|---|
| 2 |     of the referees will be reconsidered for publication.
 | 
|---|
| 3 | 
 | 
|---|
| 4 | We have attempted to address all the referee's comments in the revised version
 | 
|---|
| 5 | of the paper, with notes below for each comment.
 | 
|---|
| 6 | 
 | 
|---|
| 7 | =============================================================================
 | 
|---|
| 8 | 
 | 
|---|
| 9 |     Reviewing: 1
 | 
|---|
| 10 | 
 | 
|---|
| 11 |     As far as I can tell, the article contains three main ideas: an
 | 
|---|
| 12 |     asynchronous execution / threading model; a model for monitors to provide
 | 
|---|
| 13 |     mutual exclusion; and an implementation.  The first two ideas are drawn
 | 
|---|
| 14 |     together in Table 1: unfortunately this is on page 25 of 30 pages of
 | 
|---|
| 15 |     text. Implementation choices and descriptions are scattered throughout the
 | 
|---|
| 16 |     paper - and the sectioning of the paper seems almost arbitrary.
 | 
|---|
| 17 | 
 | 
|---|
| 18 | Fixed, Table 1 is moved to the start and explained in detail.
 | 
|---|
| 19 | 
 | 
|---|
| 20 |     The article is about its contributions.  Simply adding feature X to
 | 
|---|
| 21 |     language Y isn't by itself a contribution, (when feature X isn't already a
 | 
|---|
| 22 |     contribution).
 | 
|---|
| 23 | 
 | 
|---|
| 24 | C++ (Y) added object-oriented programming (X) to C, where OO programming (X)
 | 
|---|
| 25 | was not a contribution.
 | 
|---|
| 26 | 
 | 
|---|
| 27 |     For example: why support two kinds of generators as well as user-level
 | 
|---|
| 28 |     threads?  Why support both low and high level synchronization constructs?
 | 
|---|
| 29 | 
 | 
|---|
| 30 | Fixed, as part of discussing Table 1.
 | 
|---|
| 31 | 
 | 
|---|
| 32 |     Similarly I would have found the article easier to follow if it was written
 | 
|---|
| 33 |     top down, presenting the design principles, present the space of language
 | 
|---|
| 34 |     features, justify chosen language features (and rationale) and those
 | 
|---|
| 35 |     excluded, and then present implementation, and performance.
 | 
|---|
| 36 | 
 | 
|---|
| 37 | Fixed, the paper is now restructured in this form.
 | 
|---|
| 38 | 
 | 
|---|
| 39 |     Then the writing of the article is often hard to follow, to say the
 | 
|---|
| 40 |     least. Two examples: section 3 "stateful functions" - I've some idea
 | 
|---|
| 41 |     what that is (a function with Algol's "own" or C's "static" variables?
 | 
|---|
| 42 |     but in fact the paper has a rather more specific idea than that.
 | 
|---|
| 43 | 
 | 
|---|
| 44 | Fixed, at the start of this section.
 | 
|---|
| 45 | 
 | 
|---|
| 46 |     The top of page 3 throws a whole lot of definitions at the reader
 | 
|---|
| 47 |     "generator" "coroutine" "stackful" "stackless" "symmetric" "asymmetric"
 | 
|---|
| 48 |     without every stopping to define each one
 | 
|---|
| 49 | 
 | 
|---|
| 50 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 51 | 
 | 
|---|
| 52 |     --- but then in footnote "C" takes the time to explain what C's "main"
 | 
|---|
| 53 |     function is? I cannot imagine a reader of this paper who doesn't know what
 | 
|---|
| 54 |     "main" is in C; especially if they understand the other concepts already
 | 
|---|
| 55 |     presented in the paper.
 | 
|---|
| 56 | 
 | 
|---|
| 57 | Fixed by shortening.
 | 
|---|
| 58 | 
 | 
|---|
| 59 |     The start of section 3 then does the same
 | 
|---|
| 60 |     thing: putting up a whole lot of definitions, making distinctions and
 | 
|---|
| 61 |     comparisons, even talking about some runtime details, but the critical
 | 
|---|
| 62 |     definition of a monitor doesn't appear until three pages later, at the
 | 
|---|
| 63 |     start of section 5 on p15, lines 29-34 are a good, clear, description
 | 
|---|
| 64 |     of what a monitor actually is.  That needs to come first, rather than
 | 
|---|
| 65 |     being buried again after two sections of comparisons, discussions,
 | 
|---|
| 66 |     implementations, and options that are ungrounded because they haven't
 | 
|---|
| 67 |     told the reader what they are actually talking about.  First tell the
 | 
|---|
| 68 |     reader what something is, then how they might use it (as programmers:
 | 
|---|
| 69 |     what are the rules and restrictions) and only then start comparison
 | 
|---|
| 70 |     with other things, other approaches, other languages, or
 | 
|---|
| 71 |     implementations.
 | 
|---|
| 72 | 
 | 
|---|
| 73 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 74 | 
 | 
|---|
| 75 |     The description of the implementation is similarly lost in the trees
 | 
|---|
| 76 |     without ever really seeing the wood. Figure 19 is crucial here, but
 | 
|---|
| 77 |     it's pretty much at the end of the paper, and comments about
 | 
|---|
| 78 |     implementations are threaded throughout the paper without the context
 | 
|---|
| 79 |     (fig 19) to understand what's going on.
 | 
|---|
| 80 | 
 | 
|---|
| 81 | We have to agree to disagree on the location of Fig 19. Early discussion about
 | 
|---|
| 82 | implementation for the various control structures are specific to that feature.
 | 
|---|
| 83 | Fig 19 shows the global runtime structure, which manages only the threading
 | 
|---|
| 84 | aspect of the control structures and their global organization.
 | 
|---|
| 85 | 
 | 
|---|
| 86 |     The protocol for performance testing may just about suffice for C (although
 | 
|---|
| 87 |     is N constantly ten million, or does it vary for each benchmark)
 | 
|---|
| 88 | 
 | 
|---|
| 89 | Fixed, the paper states N varies per language/benchmark so the benchmark runs
 | 
|---|
| 90 | long enough to get a good average per operation.
 | 
|---|
| 91 | 
 | 
|---|
| 92 |     but such evaluation isn't appropriate for garbage-collected or JITTed
 | 
|---|
| 93 |     languages like Java or Go.
 | 
|---|
| 94 | 
 | 
|---|
| 95 | Please explain. All the actions in the benchmarks occur independently of the
 | 
|---|
| 96 | storage-management scheme, e.g., acquiring a lock is an aspect of execution not
 | 
|---|
| 97 | storage. In fact, garbage-collected or JITTed languages cheat on benchmarks and
 | 
|---|
| 98 | we had to take great care to prevent cheating and measure the actual operation.
 | 
|---|
| 99 | 
 | 
|---|
| 100 |     p1 only a subset of C-forall extensions?
 | 
|---|
| 101 | 
 | 
|---|
| 102 | Fixed, removed.
 | 
|---|
| 103 | 
 | 
|---|
| 104 |     p1 "has features often associated with object-oriented programming
 | 
|---|
| 105 |     languages, such as constructors, destructors, virtuals and simple
 | 
|---|
| 106 |     inheritance."  There's no need to quibble about this. Once a language has
 | 
|---|
| 107 |     inheritance, it's hard to claim it's not object-oriented.
 | 
|---|
| 108 | 
 | 
|---|
| 109 | We have to agree to disagree. Object languages are defined by the notion of
 | 
|---|
| 110 | nested functions in a aggregate structure with a special receiver parameter
 | 
|---|
| 111 | "this", not by inheritance.  Inheritance is a polymorphic mechanism, e.g,
 | 
|---|
| 112 | Plan-9 C has simple inheritance but is not object-oriented. Because Cforall
 | 
|---|
| 113 | does not have a specific receiver, it is possible to have multiple function
 | 
|---|
| 114 | parameters as receivers, which introduces new concepts like bulk acquire for
 | 
|---|
| 115 | monitors.
 | 
|---|
| 116 | 
 | 
|---|
| 117 |     p2 barging? signals-as-hints?
 | 
|---|
| 118 | 
 | 
|---|
| 119 | Added a footnote for barging. We feel these terms are well known in the
 | 
|---|
| 120 | concurrency literature, especially in pthreads and Java, and both terms have
 | 
|---|
| 121 | citations with extensive explanations and further citations.
 | 
|---|
| 122 | 
 | 
|---|
| 123 |     p3 start your discussion of generations with a simple example of a
 | 
|---|
| 124 |     C-forall generator.  Fig 1(b) might do: but put it inline instead of
 | 
|---|
| 125 |     the python example - and explain the key rules and restrictions on the
 | 
|---|
| 126 |     construct.  Then don't even start to compare with coroutines until
 | 
|---|
| 127 |     you've presented, described and explained your coroutines...
 | 
|---|
| 128 |     p3 I'd probably leave out the various "C" versions unless there are
 | 
|---|
| 129 |     key points to make you can't make in C-forall. All the alternatives
 | 
|---|
| 130 |     are just confusing.
 | 
|---|
| 131 | 
 | 
|---|
| 132 | Hopefully fixed as this block of text has been rewritten.
 | 
|---|
| 133 | 
 | 
|---|
| 134 |     p4 but what's that "with" in Fig 1(B)
 | 
|---|
| 135 | 
 | 
|---|
| 136 | Footnote D explains the semantic of "with", which is like unqualified access
 | 
|---|
| 137 | for the receiver to the fields of a class from member routines, i.e., no
 | 
|---|
| 138 | "this->".
 | 
|---|
| 139 | 
 | 
|---|
| 140 |     p5 start with the high level features of C-forall generators...
 | 
|---|
| 141 | 
 | 
|---|
| 142 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 143 | 
 | 
|---|
| 144 |     p5 why is the paper explaining networking protocols?
 | 
|---|
| 145 | 
 | 
|---|
| 146 | Fixed, added discussion on this point.
 | 
|---|
| 147 | 
 | 
|---|
| 148 |     p7 lines 1-9 (transforming generator to coroutine - why would I do any of
 | 
|---|
| 149 |     this? Why would I want one instead of the other (do not use "stack" in your
 | 
|---|
| 150 |     answer!)
 | 
|---|
| 151 | 
 | 
|---|
| 152 | As stated on line 1 because state declarations from the generator type can be
 | 
|---|
| 153 | moved out of the coroutine type into the coroutine main
 | 
|---|
| 154 | 
 | 
|---|
| 155 |     p10 last para "A coroutine must retain its last resumer to suspend back
 | 
|---|
| 156 |     because the resumer is on a different stack. These reverse pointers allow
 | 
|---|
| 157 |     suspend to cycle backwards, " I've no idea what is going on here?  why
 | 
|---|
| 158 |     should I care?  Shouldn't I just be using threads instead?  why not?
 | 
|---|
| 159 | 
 | 
|---|
| 160 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 161 | 
 | 
|---|
| 162 |     p16 for the same reasons - what reasons?
 | 
|---|
| 163 | 
 | 
|---|
| 164 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 165 | 
 | 
|---|
| 166 |     p17 if the multiple-monitor entry procedure really is novel, write a paper
 | 
|---|
| 167 |     about that, and only about that.
 | 
|---|
| 168 | 
 | 
|---|
| 169 | We do not believe this is a practical suggestion.
 | 
|---|
| 170 | 
 | 
|---|
| 171 |     p23 "Loose Object Definitions" - no idea what that means.  in that
 | 
|---|
| 172 |     section: you can't leave out JS-style dynamic properties.  Even in
 | 
|---|
| 173 |     OOLs that (one way or another) allow separate definitions of methods
 | 
|---|
| 174 |     (like Objective-C, Swift, Ruby, C#) at any time a runtime class has a
 | 
|---|
| 175 |     fixed definition.  Quite why the detail about bit mask implementation
 | 
|---|
| 176 |     is here anyway, I've no idea.
 | 
|---|
| 177 | 
 | 
|---|
| 178 | Fixed by rewriting the section.
 | 
|---|
| 179 | 
 | 
|---|
| 180 |     p25 this cluster isn't a CLU cluster then?
 | 
|---|
| 181 | 
 | 
|---|
| 182 | No. A CLU cluster is like a class in an object-oriented programming language.
 | 
|---|
| 183 | A CFA cluster is a runtime organizational mechanism.
 | 
|---|
| 184 | 
 | 
|---|
| 185 |     * conclusion should conclude the paper, not the related.
 | 
|---|
| 186 | 
 | 
|---|
| 187 | We do not understand this comment.
 | 
|---|
| 188 | 
 | 
|---|
| 189 | =============================================================================
 | 
|---|
| 190 | 
 | 
|---|
| 191 |     Reviewing: 2
 | 
|---|
| 192 | 
 | 
|---|
| 193 |     There is much description of the system and its details, but nothing about
 | 
|---|
| 194 |     (non-artificial) uses of it. Although the microbenchmark data is
 | 
|---|
| 195 |     encouraging, arguably not enough practical experience with the system has
 | 
|---|
| 196 |     been reported here to say much about either its usability advantages or its
 | 
|---|
| 197 |     performance.
 | 
|---|
| 198 | 
 | 
|---|
| 199 | We have a Catch-22 problem. Without publicity, there is no user community;
 | 
|---|
| 200 | without a user community, there are no publications for publicity.
 | 
|---|
| 201 | 
 | 
|---|
| 202 |     p2: lines 4--9 are a little sloppy. It is not the languages but their
 | 
|---|
| 203 |     popular implementations which "adopt" the 1:1 kernel threading model.
 | 
|---|
| 204 | 
 | 
|---|
| 205 | Fixed.
 | 
|---|
| 206 | 
 | 
|---|
| 207 |     line 10: "medium work" -- "medium-sized work"?
 | 
|---|
| 208 | 
 | 
|---|
| 209 | Fixed.
 | 
|---|
| 210 | 
 | 
|---|
| 211 |     line 18: "is all sequential to the compiler" -- not true in modern
 | 
|---|
| 212 |     compilers, and in 2004 H-J Boehm wrote a tech report describing exactly why
 | 
|---|
| 213 |     ("Threads cannot be implemented as a library", HP Labs).
 | 
|---|
| 214 | 
 | 
|---|
| 215 | We will have to disagree on this point. First, I am aware of Hans's 2004 paper
 | 
|---|
| 216 | because in that paper Hans cites my seminal work on this topic from 1995, which
 | 
|---|
| 217 | we cite in this paper.  Second, while modern memory-models have been added to
 | 
|---|
| 218 | languages like Java/C/C++ and new languages usually start with a memory model,
 | 
|---|
| 219 | it is still the programmer's responsibility to use them for racy code. Only
 | 
|---|
| 220 | when the programing language provides race-free constructs is the language
 | 
|---|
| 221 | aware of the concurrency; otherwise the code is sequential. Hans's paper "You
 | 
|---|
| 222 | Don't Know Jack About Shared Variables or Memory Models" talks about these
 | 
|---|
| 223 | issues, and is also cited in the paper.
 | 
|---|
| 224 | 
 | 
|---|
| 225 |     line 20: "knows the optimization boundaries" -- I found this vague. What's
 | 
|---|
| 226 |     an example?
 | 
|---|
| 227 | 
 | 
|---|
| 228 | Fixed.
 | 
|---|
| 229 | 
 | 
|---|
| 230 |     line 31: this paragraph has made a lot of claims. Perhaps forward-reference
 | 
|---|
| 231 |     to the parts of the paper that discuss each one.
 | 
|---|
| 232 | 
 | 
|---|
| 233 | Fixed by adding a road-map paragraph at the end of the introduction.
 | 
|---|
| 234 | 
 | 
|---|
| 235 |     line 33: "so the reader can judge if" -- this reads rather
 | 
|---|
| 236 |     passive-aggressively. Perhaps better: "... to support our argument that..."
 | 
|---|
| 237 | 
 | 
|---|
| 238 | Fixed.
 | 
|---|
| 239 | 
 | 
|---|
| 240 |     line 41: "a dynamic partitioning mechanism" -- I couldn't tell what this
 | 
|---|
| 241 |     meant
 | 
|---|
| 242 | 
 | 
|---|
| 243 | Fixed.
 | 
|---|
| 244 | 
 | 
|---|
| 245 |     p3. Presenting concept of a "stateful function" as a new language feature
 | 
|---|
| 246 |     seems odd. In C, functions often have local state thanks to static local
 | 
|---|
| 247 |     variables (or globals, indeed). Of course, that has several
 | 
|---|
| 248 |     limitations. Can you perhaps present your contributions by enumerating
 | 
|---|
| 249 |     these limitations? See also my suggestion below about a possible framing
 | 
|---|
| 250 |     centred on a strawman.
 | 
|---|
| 251 | 
 | 
|---|
| 252 | Fixed, at the start of this section.
 | 
|---|
| 253 | 
 | 
|---|
| 254 |     line 2: "an old idea that is new again" -- this is too oblique
 | 
|---|
| 255 | 
 | 
|---|
| 256 | Fixed, removed.
 | 
|---|
| 257 | 
 | 
|---|
| 258 |     lines 2--15: I found this to be a word/concept soup. Stacks, closures,
 | 
|---|
| 259 |     generators, stackless stackful, coroutine, symmetric, asymmetric,
 | 
|---|
| 260 |     resume/suspend versus resume/resume... there needs to be a more gradual and
 | 
|---|
| 261 |     structured way to introduce all this, and ideally one that minimises
 | 
|---|
| 262 |     redundancy. Maybe present it as a series of "definitions" each with its own
 | 
|---|
| 263 |     heading, e.g. "A closure is stackless if its local state has statically
 | 
|---|
| 264 |     known fixed size"; "A generator simply means a stackless closure." And so
 | 
|---|
| 265 |     on. Perhaps also strongly introduce the word "activate" as a direct
 | 
|---|
| 266 |     contrast with resume and suspend. These are just a flavour of the sort of
 | 
|---|
| 267 |     changes that might make this paragraph into something readable.
 | 
|---|
| 268 | 
 | 
|---|
| 269 |     Continuing the thought: I found it confusing that by these definitions, a
 | 
|---|
| 270 |     stackful closure is not a stack, even though logically the stack *is* a
 | 
|---|
| 271 |     kind of closure (it is a representation of the current thread's
 | 
|---|
| 272 |     continuation).
 | 
|---|
| 273 | 
 | 
|---|
| 274 | Fixed. Rewrote paragraph and moved Table 1 forward.
 | 
|---|
| 275 | 
 | 
|---|
| 276 |     lines 24--27: without explaining what the boost functor types mean, I don't
 | 
|---|
| 277 |     think the point here comes across.
 | 
|---|
| 278 | 
 | 
|---|
| 279 | Replaced with uC++ example because boost appears to have dropped symmetric
 | 
|---|
| 280 | coroutines.
 | 
|---|
| 281 | 
 | 
|---|
| 282 |     line 34: "semantically coupled" -- I wasn't sure what this meant
 | 
|---|
| 283 | 
 | 
|---|
| 284 | Fixed.
 | 
|---|
| 285 | 
 | 
|---|
| 286 |     p4: the point of Figure 1 (C) was not immediately clear. It seem to be
 | 
|---|
| 287 |     showing how one might "compile down" Figure 1 (B). Or is that Figure 1 (A)?
 | 
|---|
| 288 | 
 | 
|---|
| 289 | Fixed. Rewrote sentence.
 | 
|---|
| 290 | 
 | 
|---|
| 291 |     It's right that the incidental language features of the system are not
 | 
|---|
| 292 |     front-and-centre, but I'd appreciate some brief glossing of non-C languages
 | 
|---|
| 293 |     features as they appear. Examples are the square bracket notation, the pipe
 | 
|---|
| 294 |     notation and the constructor syntax. These explanations could go in the
 | 
|---|
| 295 |     caption of the figure which first uses them, perhaps. Overall I found the
 | 
|---|
| 296 |     figure captions to be terse, and a missed opportunity to explain clearly
 | 
|---|
| 297 |     what was going on.
 | 
|---|
| 298 | 
 | 
|---|
| 299 | Fixed, added descriptive footnote about Cforall. We prefer to put text in the
 | 
|---|
| 300 | body of the paper and keep captions short.
 | 
|---|
| 301 | 
 | 
|---|
| 302 |     p5 line 23: "This restriction is removed..." -- give us some up-front
 | 
|---|
| 303 |     summary of your contributions and the elements of the language design that
 | 
|---|
| 304 |     will be talked about, so that this isn't an aside. This will reduce the
 | 
|---|
| 305 |     "twisty passages" feeling that characterises much of the paper.
 | 
|---|
| 306 | 
 | 
|---|
| 307 | Fixed, remove parenthesis.
 | 
|---|
| 308 | 
 | 
|---|
| 309 |     line 40: "a killer asymmetric generator" -- this is stylistically odd, and
 | 
|---|
| 310 |     the sentence about failures doesn't convincingly argue that C\/ will help
 | 
|---|
| 311 |     with them. Have you any experience writing device drivers using C\/? Or any
 | 
|---|
| 312 |     argument that the kinds of failures can be traced to the "stack-ripping"
 | 
|---|
| 313 |     style that one is forced to use without coroutines ?
 | 
|---|
| 314 | 
 | 
|---|
| 315 | Fixed, added new paragraph.
 | 
|---|
| 316 | 
 | 
|---|
| 317 |     Also, a typo on line
 | 
|---|
| 318 |     41: "device drives". And saying "Windows/Linux" is sloppy... what does the
 | 
|---|
| 319 |     cited paper actually say?
 | 
|---|
| 320 | 
 | 
|---|
| 321 | Fixed.
 | 
|---|
| 322 | 
 | 
|---|
| 323 |     p6 lines 13--23: this paragraph is difficult to understand. It seems to be
 | 
|---|
| 324 |     talking about a control-flow pattern roughly equivalent to tail recursion.
 | 
|---|
| 325 |     What is the high-level point, other than that this is possible?
 | 
|---|
| 326 | 
 | 
|---|
| 327 | Fixed, rewrote start of the paragraph.
 | 
|---|
| 328 | 
 | 
|---|
| 329 |     line 34: "which they call coroutines" -- a better way to make this point is
 | 
|---|
| 330 |     presumably that the C++20 proposal only provides a specialised kind of
 | 
|---|
| 331 |     coroutine, namely generators, despite its use of the more general word.
 | 
|---|
| 332 | 
 | 
|---|
| 333 | Fixed.
 | 
|---|
| 334 | 
 | 
|---|
| 335 |     line 47: "... due to dynamic stack allocation, execution..." -- this
 | 
|---|
| 336 |     sentence doesn't scan. I suggest adding "and for" in the relevant places
 | 
|---|
| 337 |     where currently there are only commas.
 | 
|---|
| 338 | 
 | 
|---|
| 339 | Fixed.
 | 
|---|
| 340 | 
 | 
|---|
| 341 |     p8 / Figure 5 (B) -- the GNU C extension of unary "&&" needs to be
 | 
|---|
| 342 |     explained.
 | 
|---|
| 343 | 
 | 
|---|
| 344 | Fixed, added explanation at first usage in Figure 1(C) and reference.
 | 
|---|
| 345 | 
 | 
|---|
| 346 |     The whole figure needs a better explanation, in fact.
 | 
|---|
| 347 | 
 | 
|---|
| 348 | Fixed, rewrote start of the paragraph.
 | 
|---|
| 349 | 
 | 
|---|
| 350 |     p9, lines 1--10: I wasn't sure this stepping-through really added much
 | 
|---|
| 351 |     value. What are the truly important points to note about this code?
 | 
|---|
| 352 | 
 | 
|---|
| 353 | Fixed, shortened and merged with previous paragraph.
 | 
|---|
| 354 | 
 | 
|---|
| 355 |     p10: similarly, lines 3--27 again are somewhere between tedious and
 | 
|---|
| 356 |     confusing. I'm sure the motivation and details of "starter semantics" can
 | 
|---|
| 357 |     both be stated much more pithily.
 | 
|---|
| 358 | 
 | 
|---|
| 359 | Fixed, shortened these paragraphs.
 | 
|---|
| 360 | 
 | 
|---|
| 361 |     line 32: "a self-resume does not overwrite the last resumer" -- is this a
 | 
|---|
| 362 |     hack or a defensible principled decision?
 | 
|---|
| 363 | 
 | 
|---|
| 364 | Fixed, removed but it is a defensible principled decision.
 | 
|---|
| 365 | 
 | 
|---|
| 366 |     p11: "a common source of errors" -- among beginners or among production
 | 
|---|
| 367 |     code? Presumably the former.
 | 
|---|
| 368 | 
 | 
|---|
| 369 | Forgetting is not specific to beginners.
 | 
|---|
| 370 | 
 | 
|---|
| 371 |     line 23: "with builtin and library" -- not sure what this means
 | 
|---|
| 372 | 
 | 
|---|
| 373 | Fixed.
 | 
|---|
| 374 | 
 | 
|---|
| 375 |     lines 31--36: these can be much briefer. The only important point here
 | 
|---|
| 376 |     seems to be that coroutines cannot be copied.
 | 
|---|
| 377 | 
 | 
|---|
| 378 | Fixed, shortened.
 | 
|---|
| 379 | 
 | 
|---|
| 380 |     p12: line 1: what is a "task"? Does it matter?
 | 
|---|
| 381 | 
 | 
|---|
| 382 | Fixed, "task" has been changed to "thread" throughout the paper.
 | 
|---|
| 383 | 
 | 
|---|
| 384 |      line 7: calling it "heap stack" seems to be a recipe for
 | 
|---|
| 385 |      confusion. "Stack-and-heap" might be better, and contrast with
 | 
|---|
| 386 |      "stack-and-VLS" perhaps. When "VLS" is glossed, suggest actually expanding
 | 
|---|
| 387 |      its initials: say "length" not "size".
 | 
|---|
| 388 | 
 | 
|---|
| 389 | Fixed, make correction and rewrote some of the text.
 | 
|---|
| 390 | 
 | 
|---|
| 391 |      line 21: are you saying "cooperative threading" is the same as
 | 
|---|
| 392 |      "non-preemptive scheduling", or that one is a special case (kind) of the
 | 
|---|
| 393 |      other? Both are defensible, but be clear.
 | 
|---|
| 394 | 
 | 
|---|
| 395 | Fixed, clarified the definitions.
 | 
|---|
| 396 | 
 | 
|---|
| 397 |     line 27: "mutual exclusion and synchronization" -- the former is a kind of
 | 
|---|
| 398 |     the latter, so I suggest "and other forms of synchronization".
 | 
|---|
| 399 | 
 | 
|---|
| 400 | We have to agree to disagree. Included a citation that explains the
 | 
|---|
| 401 | differences.
 | 
|---|
| 402 | 
 | 
|---|
| 403 |     line 30: "can either be a stackless or stackful" -- stray "a", but also,
 | 
|---|
| 404 |     this seems to be switching from generic/background terminology to
 | 
|---|
| 405 |     C\/-specific terminology.
 | 
|---|
| 406 | 
 | 
|---|
| 407 | Fixed, but the terms stackless or stackful are not specific to Cforall; they
 | 
|---|
| 408 | are well known in the literature.
 | 
|---|
| 409 | 
 | 
|---|
| 410 |     An expositional idea occurs: start the paper with a strawman naive/limited
 | 
|---|
| 411 |     realisation of coroutines -- say, Simon Tatham's popular "Coroutines in C"
 | 
|---|
| 412 |     web page -- and identify point by point what the limitations are and how
 | 
|---|
| 413 |     C\/ overcomes them. Currently the presentation is often flat (lacking
 | 
|---|
| 414 |     motivating contrasts) and backwards (stating solutions before
 | 
|---|
| 415 |     problems). The foregoing approach might fix both of these.
 | 
|---|
| 416 | 
 | 
|---|
| 417 | We prefer the current structure of our paper and believe the paper does
 | 
|---|
| 418 | explain basic coding limitations and how they are overcome in using high-level
 | 
|---|
| 419 | control-floe mechanisms.
 | 
|---|
| 420 | 
 | 
|---|
| 421 |     page 13: line 23: it seems a distraction to mention the Python feature
 | 
|---|
| 422 |     here.
 | 
|---|
| 423 | 
 | 
|---|
| 424 | Why? It is the first location in the paper where dynamic allocation and
 | 
|---|
| 425 | initialization are mentioned.
 | 
|---|
| 426 | 
 | 
|---|
| 427 |     p14 line 5: it seems odd to describe these as "stateless" just because they
 | 
|---|
| 428 |     lack shared mutable state. It means the code itself is even more
 | 
|---|
| 429 |     stateful. Maybe the "stack ripping" argument could usefully be given here.
 | 
|---|
| 430 | 
 | 
|---|
| 431 | Fixed, changed "stateless" to "non-shared".
 | 
|---|
| 432 | 
 | 
|---|
| 433 |     line 16: "too restrictive" -- would be good to have a reference to justify
 | 
|---|
| 434 |     this, or at least give a sense of what the state-of-the-art performance in
 | 
|---|
| 435 |     transactional memory systems is (both software and hardware)
 | 
|---|
| 436 | 
 | 
|---|
| 437 | Fixed, added 2 citations.
 | 
|---|
| 438 | 
 | 
|---|
| 439 |     line 22: "simulate monitors" -- what about just *implementing* monitors?
 | 
|---|
| 440 |     isn't that what these systems do? or is the point more about refining them
 | 
|---|
| 441 |     somehow into something more specialised?
 | 
|---|
| 442 | 
 | 
|---|
| 443 | Fixed, changed "simulate monitors" to "manually implement a monitor".
 | 
|---|
| 444 | 
 | 
|---|
| 445 |     p15: sections 4.1 and 4.2 seem adrift and misplaced. Split them into basic
 | 
|---|
| 446 |     parts (which go earlier) and more advanced parts (e.g. barging, which can
 | 
|---|
| 447 |     be explained later).
 | 
|---|
| 448 | 
 | 
|---|
| 449 | Fixed, removed them by shortening and merging with previous section.
 | 
|---|
| 450 | 
 | 
|---|
| 451 |     line 31: "acquire/release" -- misses an opportunity to contrast the
 | 
|---|
| 452 |     monitor's "enter/exit" abstraction with the less structured acquire/release
 | 
|---|
| 453 |     of locks.
 | 
|---|
| 454 | 
 | 
|---|
| 455 | Fixed, added "by call/return" in sentence.
 | 
|---|
| 456 | 
 | 
|---|
| 457 |     p16 line 12: the "implicit" versus "explicit" point is unclear. Is it
 | 
|---|
| 458 |     perhaps about the contract between an opt-in *discipline* and a
 | 
|---|
| 459 |     language-enforced *guarantee*?
 | 
|---|
| 460 | 
 | 
|---|
| 461 | Fixed.
 | 
|---|
| 462 | 
 | 
|---|
| 463 |     line 28: no need to spend ages dithering about which one is default and
 | 
|---|
| 464 |     which one is the explicit qualifier. Tell us what you decided, briefly
 | 
|---|
| 465 |     justify it, and move on.
 | 
|---|
| 466 | 
 | 
|---|
| 467 | Fixed, shortened paragraph.
 | 
|---|
| 468 | 
 | 
|---|
| 469 |     p17: Figure 11: since the main point seems to be to highlight bulk acquire,
 | 
|---|
| 470 |     include a comment which identifies the line where this is happening.
 | 
|---|
| 471 | 
 | 
|---|
| 472 | Fixed.
 | 
|---|
| 473 | 
 | 
|---|
| 474 |     line 2: "impossible to statically..." -- or dynamically. Doing it
 | 
|---|
| 475 |     dynamically would be perfectly acceptable (locking is a dynamic operation
 | 
|---|
| 476 |     after all)
 | 
|---|
| 477 | 
 | 
|---|
| 478 | Fixed, clarified the "statically" applied to the unknown-sized pointer types.
 | 
|---|
| 479 | 
 | 
|---|
| 480 |     "guarantees acquisition order is consistent" -- assuming it's done in a
 | 
|---|
| 481 |     single bulk acquire.
 | 
|---|
| 482 | 
 | 
|---|
| 483 | Fixed.
 | 
|---|
| 484 | 
 | 
|---|
| 485 |     p18: section 5.3: the text here is a mess. The explanations of "internal"
 | 
|---|
| 486 |     versus "external" scheduling are unclear, and "signals as hints" is not
 | 
|---|
| 487 |     explained. "... can cause thread starvation" -- means including a while
 | 
|---|
| 488 |     loop, or not doing so? "There are three signalling mechanisms.." but the
 | 
|---|
| 489 |     text does not follow that by telling us what they are. My own scribbled
 | 
|---|
| 490 |     attempt at unpicking the internal/external thing: "threads already in the
 | 
|---|
| 491 |     monitor, albeit waiting, have priority over those trying to enter".
 | 
|---|
| 492 | 
 | 
|---|
| 493 | Fixed, rewrote and shortened paragraphs.
 | 
|---|
| 494 | 
 | 
|---|
| 495 |     p19: line 3: "empty condition" -- explain that condition variables don't
 | 
|---|
| 496 |     store anything. So being "empty" means that the queue of waiting threads
 | 
|---|
| 497 |     (threads waiting to be signalled that the condition has become true) is
 | 
|---|
| 498 |     empty.
 | 
|---|
| 499 | 
 | 
|---|
| 500 | Fixed, changed condition variable to condition queue throughout the paper.
 | 
|---|
| 501 | 
 | 
|---|
| 502 |     line 6: "... can be transformed into external scheduling..." -- OK, but
 | 
|---|
| 503 |     give some motivation.
 | 
|---|
| 504 | 
 | 
|---|
| 505 | The paper states that it removes the condition queues and signal/wait. Changed
 | 
|---|
| 506 | "transform" to "simplified".
 | 
|---|
| 507 | 
 | 
|---|
| 508 |     p20: line 6: "mechnaism"
 | 
|---|
| 509 | 
 | 
|---|
| 510 | Fixed.
 | 
|---|
| 511 | 
 | 
|---|
| 512 |     lines 16--20: this is dense and can probably only be made clear with an
 | 
|---|
| 513 |     example
 | 
|---|
| 514 |     
 | 
|---|
| 515 | Fixed, rewrote and added example.
 | 
|---|
| 516 | 
 | 
|---|
| 517 |     p21 line 21: clarify that nested monitor deadlock was describe earlier (in
 | 
|---|
| 518 |     5.2). (Is the repetition necessary?)
 | 
|---|
| 519 | 
 | 
|---|
| 520 | Fixed, put in a forward reference, and the point bears repeating because
 | 
|---|
| 521 | releasing a subset of acquired monitors in unique to Cforall concurrency.
 | 
|---|
| 522 | 
 | 
|---|
| 523 |     line 27: "locks, and by extension monitors" -- this is true but the "by
 | 
|---|
| 524 |     extension" argument is faulty. It is perfectly possible to use locks as a
 | 
|---|
| 525 |     primitive and build a compositional mechanism out of them,
 | 
|---|
| 526 |     e.g. transactions.
 | 
|---|
| 527 | 
 | 
|---|
| 528 | True, but that is not what we said. Locks are not composable, monitors are
 | 
|---|
| 529 | built using locks not transactions, so by extension monitors are not composable.
 | 
|---|
| 530 | 
 | 
|---|
| 531 |     p22 line 2: should say "restructured"
 | 
|---|
| 532 | 
 | 
|---|
| 533 | Fixed.
 | 
|---|
| 534 | 
 | 
|---|
| 535 |     line 33: "Implementing a fast subset check..." -- make clear that the
 | 
|---|
| 536 |     following section explains how to do this. Restructuring the sections
 | 
|---|
| 537 |     themselves could do this, or noting in the text.
 | 
|---|
| 538 | 
 | 
|---|
| 539 | Fixed, added a forward reference to the following sections.
 | 
|---|
| 540 | 
 | 
|---|
| 541 |     p23: line 3: "dynamic member adding, eg, JavaScript" -- needs to say "as
 | 
|---|
| 542 |     permitted in JavaScript", and "dynamically adding members" is stylistically
 | 
|---|
| 543 |     better
 | 
|---|
| 544 | 
 | 
|---|
| 545 | Fixed.
 | 
|---|
| 546 | 
 | 
|---|
| 547 |     p23: line 18: "urgent stack" -- back-reference to where this was explained
 | 
|---|
| 548 |     before
 | 
|---|
| 549 | 
 | 
|---|
| 550 | Fixed.
 | 
|---|
| 551 | 
 | 
|---|
| 552 |     p24 line 7: I did not understand what was more "direct" about "direct
 | 
|---|
| 553 |     communication". Also, what is a "passive monitor" -- just a monitor, given
 | 
|---|
| 554 |     that monitors are passive by design?
 | 
|---|
| 555 | 
 | 
|---|
| 556 | The back half of line 7 defines "direct". For example, Go, Java, pthread
 | 
|---|
| 557 | threads cannot directly call/communicate with one another, where they can in
 | 
|---|
| 558 | Ada, uC++, and Cforall threads. Figure 18 show this exact difference.
 | 
|---|
| 559 | 
 | 
|---|
| 560 | A monitor object is *passive* because it does not have a thread, while a Go,
 | 
|---|
| 561 | Java, Cforall "thread" object is *active* because it has a thread.
 | 
|---|
| 562 | 
 | 
|---|
| 563 |     line 14 / section 5.9: this table was useful and it (or something like it)
 | 
|---|
| 564 |     could be used much earlier on to set the structure of the rest of the
 | 
|---|
| 565 |     paper.
 | 
|---|
| 566 | 
 | 
|---|
| 567 | Fixed, Table 1 is moved to the start and explained in detail.
 | 
|---|
| 568 | 
 | 
|---|
| 569 |     The explanation at present is too brief, e.g. I did not really understand
 | 
|---|
| 570 |     the point about cases 7 and 8. Table 1: what does "No / Yes" mean?
 | 
|---|
| 571 | 
 | 
|---|
| 572 | Fixed, expanded the explanation.
 | 
|---|
| 573 | 
 | 
|---|
| 574 |     p25 line 2: instead of casually dropping in a terse explanation for the
 | 
|---|
| 575 |     newly introduced term "virtual processor", introduce it
 | 
|---|
| 576 |     properly. Presumably the point is to give a less ambiguous meaning to
 | 
|---|
| 577 |     "thread" by reserving it only for C\/'s green threads.
 | 
|---|
| 578 | 
 | 
|---|
| 579 | Fixed.
 | 
|---|
| 580 | 
 | 
|---|
| 581 |     p26 line 15: "transforms user threads into fibres" -- a reference is needed
 | 
|---|
| 582 |     to explain what "fibres" means... guessing it's in the sense of Adya et al.
 | 
|---|
| 583 | 
 | 
|---|
| 584 | Fixed. In a prior correct, the term fibre from Adya is defined.
 | 
|---|
| 585 | 
 | 
|---|
| 586 |     line 20: "Microsoft runtime" -- means Windows?
 | 
|---|
| 587 | 
 | 
|---|
| 588 | Fixed.
 | 
|---|
| 589 | 
 | 
|---|
| 590 |     lines 21--26: don't say "interrupt" to mean "signal", especially not
 | 
|---|
| 591 |     without clear introduction. You can use "POSIX signal" to disambiguate from
 | 
|---|
| 592 |     condition variables' "signal".
 | 
|---|
| 593 | 
 | 
|---|
| 594 | We have to agree to disagree on this terminology. Interrupt is the action of
 | 
|---|
| 595 | stopping the CPU while a signal is a specific kind of interrupt. The two terms
 | 
|---|
| 596 | seem to be well understood in the literature.
 | 
|---|
| 597 | 
 | 
|---|
| 598 |     p27 line 3: "frequency is usually long" -- that's a "time period" or
 | 
|---|
| 599 |     "interval", not a frequency
 | 
|---|
| 600 | 
 | 
|---|
| 601 | Fixed.
 | 
|---|
| 602 | 
 | 
|---|
| 603 |     line 5: the lengthy quotation is not really necessary; just paraphrase the
 | 
|---|
| 604 |     first sentence and move on.
 | 
|---|
| 605 | 
 | 
|---|
| 606 | Fixed.
 | 
|---|
| 607 | 
 | 
|---|
| 608 |     line 20: "to verify the implementation" -- I don't think that means what is
 | 
|---|
| 609 |     intended
 | 
|---|
| 610 | 
 | 
|---|
| 611 | Fixed, changed "verify" to "test".
 | 
|---|
| 612 | 
 | 
|---|
| 613 |     Tables in section 7 -- too many significant figures. How many overall runs
 | 
|---|
| 614 |     are described? What is N in each case?
 | 
|---|
| 615 | 
 | 
|---|
| 616 | Fixed. As stated, N=31.
 | 
|---|
| 617 | 
 | 
|---|
| 618 |     p29 line 2: "to eliminate this cost" -- arguably confusing since nowadays
 | 
|---|
| 619 |     on commodity CPUs most of the benefits of inlining are not to do with call
 | 
|---|
| 620 |     overheads, but from later optimizations enabled as a consequence of the
 | 
|---|
| 621 |     inlining
 | 
|---|
| 622 | 
 | 
|---|
| 623 | Fixed.
 | 
|---|
| 624 | 
 | 
|---|
| 625 |     line 41: "a hierarchy" -- are they a hierarchy? If so, this could be
 | 
|---|
| 626 |     explained earlier. Also, to say these make up "an integrated set... of
 | 
|---|
| 627 |     control-flow features" verges on the tautologous.
 | 
|---|
| 628 | 
 | 
|---|
| 629 | Fixed, rewrote sentence.
 | 
|---|
| 630 | 
 | 
|---|
| 631 |     p30 line 15: "a common case being web servers and XaaS" -- that's two cases
 | 
|---|
| 632 | 
 | 
|---|
| 633 | Fixed.
 | 
|---|
| 634 | 
 | 
|---|
| 635 | ============================================================================
 | 
|---|
| 636 | 
 | 
|---|
| 637 |     Reviewing: 3
 | 
|---|
| 638 | 
 | 
|---|
| 639 |     * Expand on the motivations for including both generator and coroutines, vs
 | 
|---|
| 640 |       trying to build one atop the other
 | 
|---|
| 641 | 
 | 
|---|
| 642 | Fixed, Table 1 is moved to the start and explained in detail.
 | 
|---|
| 643 | 
 | 
|---|
| 644 |     * Expand on the motivations for having both symmetric and asymmetric
 | 
|---|
| 645 |       coroutines?
 | 
|---|
| 646 | 
 | 
|---|
| 647 | A coroutine is not marked as symmetric or asymmetric, it is a coroutine.
 | 
|---|
| 648 | Symmetric or asymmetric is a stylistic use of a coroutine. By analogy, a
 | 
|---|
| 649 | function is not marked as recursive or non-recursive. Recursion is a style of
 | 
|---|
| 650 | programming with a function. So there is no notion of motivation for having
 | 
|---|
| 651 | both symmetric and asymmetric as they follow from how a programmer uses suspend
 | 
|---|
| 652 | and resume.
 | 
|---|
| 653 | 
 | 
|---|
| 654 |     * Comparison to async-await model adopted by other languages
 | 
|---|
| 655 | 
 | 
|---|
| 656 | Fixed, added a new section on this topic.
 | 
|---|
| 657 | 
 | 
|---|
| 658 |     * Consider performance comparisons against node.js and Rust frameworks
 | 
|---|
| 659 | 
 | 
|---|
| 660 | Fixed.
 | 
|---|
| 661 | 
 | 
|---|
| 662 |     * Discuss performance of monitors vs finer-grained memory models and atomic
 | 
|---|
| 663 |       operations found in other languages
 | 
|---|
| 664 | 
 | 
|---|
| 665 | The paper never suggested high-level concurrency constructs can or should
 | 
|---|
| 666 | replace race programming or hardware atomics. The paper suggests programmers
 | 
|---|
| 667 | use high-level constructs when and where is it feasible because they are easy
 | 
|---|
| 668 | and safer to use. The monitor example of an atomic counter is just that, an
 | 
|---|
| 669 | example, not the way it should be done if maximal performance is required.  We
 | 
|---|
| 670 | have tried to make this point clear in the paper.
 | 
|---|
| 671 | 
 | 
|---|
| 672 |     * Why both internal/external scheduling for synchronization?
 | 
|---|
| 673 | 
 | 
|---|
| 674 | Some additional motivation has been added.
 | 
|---|
| 675 | 
 | 
|---|
| 676 |     * Generators are not exposed as a "function" that returns a generator
 | 
|---|
| 677 |       object, but rather as a kind of struct, with communication happening via
 | 
|---|
| 678 |       mutable state instead of "return values".
 | 
|---|
| 679 | 
 | 
|---|
| 680 | Yes, Cforall uses an object-style of coroutine, which allows multiple interface
 | 
|---|
| 681 | functions that pass and return values through a structure. This approach allows
 | 
|---|
| 682 | a generator function to have different kinds of return values and different
 | 
|---|
| 683 | kinds of parameters to produce those values. Our generators can provide this
 | 
|---|
| 684 | capability via multiple interface functions to the generator/coroutine state,
 | 
|---|
| 685 | which is discussed on page 5, lines 13-21.
 | 
|---|
| 686 | 
 | 
|---|
| 687 |       That is, the generator must be manually resumed and (if I understood) it
 | 
|---|
| 688 |       is expected to store values that can then later be read (perhaps via
 | 
|---|
| 689 |       methods), instead of having a `yield <Expr>` statement that yields up a
 | 
|---|
| 690 |       value explicitly.
 | 
|---|
| 691 | 
 | 
|---|
| 692 | All generators are manually resumed, e.g., Python/nodejs use "next" to resume a
 | 
|---|
| 693 | generator. Yes, yield <Expr> has a single interface with one input/return type,
 | 
|---|
| 694 | versus the Cforall approach allowing arbitrary number of interfaces of
 | 
|---|
| 695 | arbitrary types.
 | 
|---|
| 696 | 
 | 
|---|
| 697 |     * Both "symmetric" and "asymmetric" generators are supported, instead of
 | 
|---|
| 698 |       only asymmetric.
 | 
|---|
| 699 | 
 | 
|---|
| 700 | Yes, because they support different functionality as discussed in Chris
 | 
|---|
| 701 | Marlin's seminal work and both forms are implemented in Simula67. We did not
 | 
|---|
| 702 | invent symmetric and asymmetric generators/coroutines, we took them from the
 | 
|---|
| 703 | literature.
 | 
|---|
| 704 | 
 | 
|---|
| 705 |     * Coroutines (multi-frame generators) are an explicit mechanism.
 | 
|---|
| 706 | 
 | 
|---|
| 707 |     In most other languages, coroutines are rather built by layering
 | 
|---|
| 708 |     single-frame generators atop one another (e.g., using a mechanism like
 | 
|---|
| 709 |     async-await),
 | 
|---|
| 710 | 
 | 
|---|
| 711 | We disagree. Node.js has async-await but has a separate coroutine feature.
 | 
|---|
| 712 | While there are claims that coroutines can be built from async-await and/or
 | 
|---|
| 713 | continuations, in actuality they cannot.
 | 
|---|
| 714 | 
 | 
|---|
| 715 |     and symmetric coroutines are basically not supported. I'd like to see a bit
 | 
|---|
| 716 |     more justification for Cforall including all the above mechanisms -- it
 | 
|---|
| 717 |     seemed like symmetric coroutines were a useful building block for some of
 | 
|---|
| 718 |     the user-space threading and custom scheduler mechanisms that were briefly
 | 
|---|
| 719 |     mentioned later in the paper.
 | 
|---|
| 720 | 
 | 
|---|
| 721 | Hopefully fixed by moving Table 1 forward.
 | 
|---|
| 722 | 
 | 
|---|
| 723 |     In the discussion of coroutines, I would have expected a bit more of a
 | 
|---|
| 724 |     comparison to the async-await mechanism offered in other languages.
 | 
|---|
| 725 | 
 | 
|---|
| 726 | We added a new section at the start to point out there is no comparison between
 | 
|---|
| 727 | coroutines and async-await.
 | 
|---|
| 728 | 
 | 
|---|
| 729 |     Certainly the semantics of async-await in JavaScript implies
 | 
|---|
| 730 |     significantly more overhead (because each async fn is a distinct heap
 | 
|---|
| 731 |     object). [Rust's approach avoids this overhead][zc], however, and might be
 | 
|---|
| 732 |     worthy of a comparison (see the Performance section).
 | 
|---|
| 733 | 
 | 
|---|
| 734 | We could not get Rust async-await to work, and when reading the description of
 | 
|---|
| 735 | rust async-await, it appears to be Java-style executors with futures (possibly
 | 
|---|
| 736 | fast futures).
 | 
|---|
| 737 | 
 | 
|---|
| 738 |     There are several sections in the paper that compare against atomics -- for
 | 
|---|
| 739 |     example, on page 15, the paper shows a simple monitor that encapsulates an
 | 
|---|
| 740 |     integer and compares that to C++ atomics. Later, the paper compares the
 | 
|---|
| 741 |     simplicity of monitors against the `volatile` quantifier from Java. The
 | 
|---|
| 742 |     conclusion in section 8 also revisits this point.
 | 
|---|
| 743 |     While I agree that monitors are simpler, they are obviously also
 | 
|---|
| 744 |     significantly different from a performance perspective -- the paper doesn't
 | 
|---|
| 745 |     seem to address this at all. It's plausible that (e.g.) the `Aint` monitor
 | 
|---|
| 746 |     type described in the paper can be compiled and mapped to the specialized
 | 
|---|
| 747 |     instructions offered by hardware, but I didn't see any mention of how this
 | 
|---|
| 748 |     would be done.
 | 
|---|
| 749 | 
 | 
|---|
| 750 | Fixed, see response above.
 | 
|---|
| 751 | 
 | 
|---|
| 752 |     There is also no mention of the more nuanced memory ordering
 | 
|---|
| 753 |     relations offered by C++11 and how one might achieve similar performance
 | 
|---|
| 754 |     characteristics in Cforall (perhaps the answer is that one simply doesn't
 | 
|---|
| 755 |     need to; I think that's defensible, but worth stating explicitly).
 | 
|---|
| 756 | 
 | 
|---|
| 757 | Cforall is built on C, and therefore has full access to all the gcc atomics,
 | 
|---|
| 758 | and automatically gets any gcc updates.  Furthermore, section 6.9 states that
 | 
|---|
| 759 | Cforall provides the full panoply of low-level locks, as does Java, Go, C++,
 | 
|---|
| 760 | for performance programming.
 | 
|---|
| 761 | 
 | 
|---|
| 762 |     Cforall includes both internal and external scheduling; I found the
 | 
|---|
| 763 |     explanation for the external scheduling mechanism to be lacking in
 | 
|---|
| 764 |     justification. Why include both mechanisms when most languages seem to make
 | 
|---|
| 765 |     do with only internal scheduling? It would be useful to show some scenarios
 | 
|---|
| 766 |     where external scheduling is truly more powerful.
 | 
|---|
| 767 | 
 | 
|---|
| 768 | Fixed. Pointed out external scheduling is simpler as part of rewriting in that
 | 
|---|
| 769 | section, and added additional examples.
 | 
|---|
| 770 | 
 | 
|---|
| 771 |     I would have liked to see some more discussion of external scheduling and
 | 
|---|
| 772 |     how it interacts with software engineering best practices. It seems
 | 
|---|
| 773 |     somewhat similar to AOP in certain regards. It seems to add a bit of "extra
 | 
|---|
| 774 |     semantics" to monitor methods, in that any method may now also become a
 | 
|---|
| 775 |     kind of synchronization point.
 | 
|---|
| 776 | 
 | 
|---|
| 777 | Fixed somewhat. Pointed out that external scheduling has been around for a long
 | 
|---|
| 778 | time (40 years) in Ada, so there is a body of the software-engineering
 | 
|---|
| 779 | experience using it. As well, I have been teaching it for 30 years in the
 | 
|---|
| 780 | concurrency course at Waterloo. We don't know what software engineering best
 | 
|---|
| 781 | practices you imagine it interacting with. Yes, monitor functions are
 | 
|---|
| 782 | synchronization points with external scheduling.
 | 
|---|
| 783 | 
 | 
|---|
| 784 |     The "open-ended" nature of this feels like it could easily lead to subtle
 | 
|---|
| 785 |     bugs, particularly when code refactoring occurs (which may e.g. split an
 | 
|---|
| 786 |     existing method into two).
 | 
|---|
| 787 | 
 | 
|---|
| 788 | Any time a public interface is refactored, it invalids existing calls, so there
 | 
|---|
| 789 | is always an issue. For mutex routines and external scheduling, the waitfor
 | 
|---|
| 790 | statements may have to be updated, but that update is part of the refactoring.
 | 
|---|
| 791 | 
 | 
|---|
| 792 |     This seems particularly true if external scheduling can occur across
 | 
|---|
| 793 |     compilation units -- the paper suggested that this is true, but I wasn't
 | 
|---|
| 794 |     entirely clear.
 | 
|---|
| 795 | 
 | 
|---|
| 796 | Every aspect of Cforall allows separate compilation. The function prototypes
 | 
|---|
| 797 | necessary for separate compilation provide all the information necessary to
 | 
|---|
| 798 | compile any aspect of a program.
 | 
|---|
| 799 | 
 | 
|---|
| 800 |     I would have also appreciated a few more details on how external scheduling
 | 
|---|
| 801 |     is implemented. It seems to me that there must be some sort of "hooks" on
 | 
|---|
| 802 |     mutex methods so that they can detect whether some other function is
 | 
|---|
| 803 |     waiting on them and awaken those blocked threads. I'm not sure how such
 | 
|---|
| 804 |     hooks are inserted, particularly across compilation units.
 | 
|---|
| 805 | 
 | 
|---|
| 806 | Hooks are inserted by the Cforall translator, in the same way that Java
 | 
|---|
| 807 | inserted hooks into a "synchronized" member of a monitor. As for Java, as long
 | 
|---|
| 808 | as the type information is consistent across compilation units, the correct
 | 
|---|
| 809 | code is inserted.
 | 
|---|
| 810 | 
 | 
|---|
| 811 |     The material in Section 5.6 didn't quite clarify the matter for me. For
 | 
|---|
| 812 |     example, it left me somewhat confused about whether the `f` and `g`
 | 
|---|
| 813 |     functions declared were meant to be local to a translation unit, or shared
 | 
|---|
| 814 |     with other unit.
 | 
|---|
| 815 | 
 | 
|---|
| 816 | There are no restrictions with respect to static or external mutex functions.
 | 
|---|
| 817 | Cforall is C. Any form of access or separate compilation in C applies to
 | 
|---|
| 818 | Cforall. As in C, function prototypes carry all necessary information to
 | 
|---|
| 819 | compile the code.
 | 
|---|
| 820 | 
 | 
|---|
| 821 |     To start, I did not realize that the `mutex_opt` notation was a keyword, I
 | 
|---|
| 822 |     thought it was a type annotation. I think this could be called out more
 | 
|---|
| 823 |     explicitly.
 | 
|---|
| 824 | 
 | 
|---|
| 825 | Fixed, indicated "mutex" is a C-style parameter-only declaration type-qualifier.
 | 
|---|
| 826 | 
 | 
|---|
| 827 |     Later, in section 5.2, the paper discusses `nomutex` annotations, which
 | 
|---|
| 828 |     initially threw me, as they had not been introduced (now I realize that
 | 
|---|
| 829 |     this paragraph is there to justify why there is no such keyword). The
 | 
|---|
| 830 |     paragraph might be rearranged to make that clearer, perhaps by leading with
 | 
|---|
| 831 |     the choice that Cforall made.
 | 
|---|
| 832 | 
 | 
|---|
| 833 | Fixed, rewrote paragraph removing nomutex.
 | 
|---|
| 834 | 
 | 
|---|
| 835 |     On page 17, the paper states that "acquiring multiple monitors is safe from
 | 
|---|
| 836 |     deadlock", but this could be stated a bit more precisely: acquiring
 | 
|---|
| 837 |     multiple monitors in a bulk-acquire is safe from deadlock (deadlock can
 | 
|---|
| 838 |     still result from nested acquires).
 | 
|---|
| 839 | 
 | 
|---|
| 840 | Fixed.
 | 
|---|
| 841 | 
 | 
|---|
| 842 |     On page 18, the paper states that wait states do not have to be enclosed in
 | 
|---|
| 843 |     loops, as there is no concern of barging. This seems true but there are
 | 
|---|
| 844 |     also other reasons to use loops (e.g., if there are multiple reasons to
 | 
|---|
| 845 |     notify on the same condition). Thus the statement initially surprised me,
 | 
|---|
| 846 |     as barging is only one of many reasons that I typically employ loops around
 | 
|---|
| 847 |     waits.
 | 
|---|
| 848 | 
 | 
|---|
| 849 | Fixed. Rewrote the sentence. Note, for all non-barging cases where you employ a
 | 
|---|
| 850 | loop around a wait, the unblocking task must change state before blocking
 | 
|---|
| 851 | again.  In the barging case, the unblocking thread blocks again without
 | 
|---|
| 852 | changing state.
 | 
|---|
| 853 | 
 | 
|---|
| 854 |     I did not understand the diagram in Figure 12 for some time. Initially, I
 | 
|---|
| 855 |     thought that it was generic to all monitors, and I could not understand the
 | 
|---|
| 856 |     state space. It was only later that I realized it was specific to your
 | 
|---|
| 857 |     example. Updating the caption from "Monitor scheduling to "Monitor
 | 
|---|
| 858 |     scheduling in the example from Fig 13" might have helped me quite a bit.
 | 
|---|
| 859 | 
 | 
|---|
| 860 | Fixed, updated text to clarify. Did not change the caption because the
 | 
|---|
| 861 | signal_block does not apply to Figure 13.
 | 
|---|
| 862 | 
 | 
|---|
| 863 |     I spent quite some time reading the boy/girl dating example (\*) and I
 | 
|---|
| 864 |     admit I found it somewhat confusing. For example, I couldn't tell whether
 | 
|---|
| 865 |     there were supposed to be many "girl" threads executing at once, or if
 | 
|---|
| 866 |     there was only supposed to be one girl and one boy thread executing in a
 | 
|---|
| 867 |     loop.
 | 
|---|
| 868 | 
 | 
|---|
| 869 | The paper states:
 | 
|---|
| 870 | 
 | 
|---|
| 871 |   The dating service matches girl and boy threads with matching compatibility
 | 
|---|
| 872 |   codes so they can exchange phone numbers.
 | 
|---|
| 873 | 
 | 
|---|
| 874 | so there are many girl/boy threads. There is nothing preventing an individual
 | 
|---|
| 875 | girl/boy from arranging multiple dates.
 | 
|---|
| 876 | 
 | 
|---|
| 877 |     Are the girl/boy threads supposed to invoke the girl/boy methods or vice
 | 
|---|
| 878 |     versa?
 | 
|---|
| 879 | 
 | 
|---|
| 880 | As long as the girls/boys are consistent in the calls, it does not matter. The
 | 
|---|
| 881 | goal is to find a partner and exchange phone numbers.
 | 
|---|
| 882 | 
 | 
|---|
| 883 |     Surely there is some easier way to set this up?
 | 
|---|
| 884 | 
 | 
|---|
| 885 | There are some other solutions using monitors but they all have a similar
 | 
|---|
| 886 | structure.
 | 
|---|
| 887 | 
 | 
|---|
| 888 |     The paper offered a number of comparisons to Go, C#, Scala, and so forth,
 | 
|---|
| 889 |     but seems to have overlooked another recent language, Rust. In many ways,
 | 
|---|
| 890 |     Rust seems to be closest in philosophy to Cforall, so it seems like an odd
 | 
|---|
| 891 |     omission. I already mentioned above that Rust is in the process of shipping
 | 
|---|
| 892 |     [async-await syntax][aa], which is definitely an alternative to the
 | 
|---|
| 893 |     generator/coroutine approach in Cforall (though one with clear pros/cons).
 | 
|---|
| 894 | 
 | 
|---|
| 895 | We cannot get rust async-await example programs to compile nor does the select!
 | 
|---|
| 896 | macro compile.
 | 
|---|
| 897 | 
 | 
|---|
| 898 |   @plg2[1]% rustc --version
 | 
|---|
| 899 |   rustc 1.40.0 (73528e339 2019-12-16)
 | 
|---|
| 900 |   
 | 
|---|
| 901 |   @plg2[2]% cat future.rs 
 | 
|---|
| 902 |   use futures::executor::block_on;
 | 
|---|
| 903 |   
 | 
|---|
| 904 |   async fn hello_world() {
 | 
|---|
| 905 |       println!("hello, world!");
 | 
|---|
| 906 |   }
 | 
|---|
| 907 |   
 | 
|---|
| 908 |   fn main() {
 | 
|---|
| 909 |       let future = hello_world(); // Nothing is printed
 | 
|---|
| 910 |       block_on(future); // `future` is run and "hello, world!" is printed
 | 
|---|
| 911 |   }
 | 
|---|
| 912 |   
 | 
|---|
| 913 |   @plg2[3]% rustc -C opt-level=3 future.rs
 | 
|---|
| 914 |   error[E0670]: `async fn` is not permitted in the 2015 edition
 | 
|---|
| 915 |    --> future.rs:3:1
 | 
|---|
| 916 |     |
 | 
|---|
| 917 |   3 | async fn hello_world() {
 | 
|---|
| 918 |     | ^^^^^
 | 
|---|
| 919 |   
 | 
|---|
| 920 |   error[E0433]: failed to resolve: maybe a missing crate `futures`?
 | 
|---|
| 921 |    --> future.rs:1:5
 | 
|---|
| 922 |     |
 | 
|---|
| 923 |   1 | use futures::executor::block_on;
 | 
|---|
| 924 |     |     ^^^^^^^ maybe a missing crate `futures`?
 | 
|---|
| 925 |   
 | 
|---|
| 926 |   error[E0425]: cannot find function `block_on` in this scope
 | 
|---|
| 927 |    --> future.rs:9:5
 | 
|---|
| 928 |     |
 | 
|---|
| 929 |   9 |     block_on(future); // `future` is run and "hello, world!" is printed
 | 
|---|
| 930 |     |     ^^^^^^^^ not found in this scope
 | 
|---|
| 931 |   
 | 
|---|
| 932 |   error: aborting due to 3 previous errors
 | 
|---|
| 933 |   
 | 
|---|
| 934 |   Some errors have detailed explanations: E0425, E0433, E0670.
 | 
|---|
| 935 |   For more information about an error, try `rustc --explain E0425`.
 | 
|---|
| 936 | 
 | 
|---|
| 937 | 
 | 
|---|
| 938 |     In the performance section in particular, you might consider comparing
 | 
|---|
| 939 |     against some of the Rust web servers and threading systems.
 | 
|---|
| 940 | 
 | 
|---|
| 941 | This paper is not about building web-servers. Nor are web-servers a reasonable
 | 
|---|
| 942 | benchmark for language concurrency. Web-servers are a benchmark for
 | 
|---|
| 943 | non-blocking I/O library efficiency accessed in the underlying operating
 | 
|---|
| 944 | system. Our prior work on web-server performance:
 | 
|---|
| 945 | 
 | 
|---|
| 946 | @inproceedings{Pariag07,
 | 
|---|
| 947 |     author      = {David Pariag and Tim Brecht and Ashif Harji and Peter Buhr and Amol Shukla},
 | 
|---|
| 948 |     title       = {Comparing the Performance of Web Server Architectures},
 | 
|---|
| 949 |     booktitle   = {Proceedings of the 2007 Eurosys conference},
 | 
|---|
| 950 |     month       = mar,
 | 
|---|
| 951 |     year        = 2007,
 | 
|---|
| 952 |     pages       = {231--243},
 | 
|---|
| 953 | }
 | 
|---|
| 954 | 
 | 
|---|
| 955 | @inproceedings{Harji12,
 | 
|---|
| 956 |     author      = {Ashif S. Harji and Peter A. Buhr and Tim Brecht},
 | 
|---|
| 957 |     title       = {Comparing High-Performance Multi-core Web-Server Architectures},
 | 
|---|
| 958 |     booktitle   = {Proceedings of the 5th Annual International Systems and Storage Conference},
 | 
|---|
| 959 |     series      = {SYSTOR '12},
 | 
|---|
| 960 |     publisher   = {ACM},
 | 
|---|
| 961 |     address     = {New York, NY, USA},
 | 
|---|
| 962 |     location    = {Haifa, Israel},
 | 
|---|
| 963 |     month       = jun,
 | 
|---|
| 964 |     year        = 2012,
 | 
|---|
| 965 |     articleno   = 1,
 | 
|---|
| 966 |     pages       = {1:1--1:12},
 | 
|---|
| 967 | }
 | 
|---|
| 968 | 
 | 
|---|
| 969 | shows the steps to build a high-performance web-server, which are largely
 | 
|---|
| 970 | independent of the server architecture and programing language.
 | 
|---|
| 971 | 
 | 
|---|
| 972 |     It would seem worth trying to compare their "context switching" costs as
 | 
|---|
| 973 |     well -- I believe both actix and tokio have a notion of threads that could
 | 
|---|
| 974 |     be readily compared.
 | 
|---|
| 975 | 
 | 
|---|
| 976 | Again, context-switching speed is largely irrelevant because the amount of code
 | 
|---|
| 977 | to process an http request is large enough to push any concurrency costs into
 | 
|---|
| 978 | the background.
 | 
|---|
| 979 | 
 | 
|---|
| 980 |     Another addition that might be worth considering is to compare against
 | 
|---|
| 981 |     node.js promises, although I think the comparison to process creation is
 | 
|---|
| 982 |     not as clean.
 | 
|---|
| 983 | 
 | 
|---|
| 984 | Done.
 | 
|---|