| [2aab69b] | 1 | A revised version of your manuscript that takes into account the comments | 
|---|
|  | 2 | of the referees will be reconsidered for publication. | 
|---|
|  | 3 |  | 
|---|
|  | 4 | We have attempted to address all the referee's comments in the revised version | 
|---|
|  | 5 | of the paper, with notes below for each comment. | 
|---|
|  | 6 |  | 
|---|
|  | 7 | ============================================================================= | 
|---|
|  | 8 |  | 
|---|
|  | 9 | Reviewing: 1 | 
|---|
|  | 10 |  | 
|---|
|  | 11 | As far as I can tell, the article contains three main ideas: an | 
|---|
|  | 12 | asynchronous execution / threading model; a model for monitors to provide | 
|---|
|  | 13 | mutual exclusion; and an implementation.  The first two ideas are drawn | 
|---|
|  | 14 | together in Table 1: unfortunately this is on page 25 of 30 pages of | 
|---|
|  | 15 | text. Implementation choices and descriptions are scattered throughout the | 
|---|
|  | 16 | paper - and the sectioning of the paper seems almost arbitrary. | 
|---|
|  | 17 |  | 
|---|
|  | 18 | Fixed, Table 1 is moved to the start and explained in detail. | 
|---|
|  | 19 |  | 
|---|
|  | 20 | The article is about its contributions.  Simply adding feature X to | 
|---|
|  | 21 | language Y isn't by itself a contribution, (when feature X isn't already a | 
|---|
|  | 22 | contribution). | 
|---|
|  | 23 |  | 
|---|
|  | 24 | C++ (Y) added object-oriented programming (X) to C, where OO programming (X) | 
|---|
|  | 25 | was not a contribution. | 
|---|
|  | 26 |  | 
|---|
|  | 27 | For example: why support two kinds of generators as well as user-level | 
|---|
|  | 28 | threads?  Why support both low and high level synchronization constructs? | 
|---|
|  | 29 |  | 
|---|
|  | 30 | Fixed, as part of discussing Table 1. | 
|---|
|  | 31 |  | 
|---|
|  | 32 | Similarly I would have found the article easier to follow if it was written | 
|---|
|  | 33 | top down, presenting the design principles, present the space of language | 
|---|
|  | 34 | features, justify chosen language features (and rationale) and those | 
|---|
|  | 35 | excluded, and then present implementation, and performance. | 
|---|
|  | 36 |  | 
|---|
|  | 37 | Fixed, the paper is now restructured in this form. | 
|---|
|  | 38 |  | 
|---|
|  | 39 | Then the writing of the article is often hard to follow, to say the | 
|---|
|  | 40 | least. Two examples: section 3 "stateful functions" - I've some idea | 
|---|
|  | 41 | what that is (a function with Algol's "own" or C's "static" variables? | 
|---|
|  | 42 | but in fact the paper has a rather more specific idea than that. | 
|---|
|  | 43 |  | 
|---|
|  | 44 | Fixed, at the start of this section. | 
|---|
|  | 45 |  | 
|---|
|  | 46 | The top of page 3 throws a whole lot of definitions at the reader | 
|---|
|  | 47 | "generator" "coroutine" "stackful" "stackless" "symmetric" "asymmetric" | 
|---|
|  | 48 | without every stopping to define each one | 
|---|
|  | 49 |  | 
|---|
|  | 50 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 51 |  | 
|---|
|  | 52 | --- but then in footnote "C" takes the time to explain what C's "main" | 
|---|
|  | 53 | function is? I cannot imagine a reader of this paper who doesn't know what | 
|---|
|  | 54 | "main" is in C; especially if they understand the other concepts already | 
|---|
|  | 55 | presented in the paper. | 
|---|
|  | 56 |  | 
|---|
|  | 57 | Fixed by shortening. | 
|---|
|  | 58 |  | 
|---|
|  | 59 | The start of section 3 then does the same | 
|---|
|  | 60 | thing: putting up a whole lot of definitions, making distinctions and | 
|---|
|  | 61 | comparisons, even talking about some runtime details, but the critical | 
|---|
|  | 62 | definition of a monitor doesn't appear until three pages later, at the | 
|---|
|  | 63 | start of section 5 on p15, lines 29-34 are a good, clear, description | 
|---|
|  | 64 | of what a monitor actually is.  That needs to come first, rather than | 
|---|
|  | 65 | being buried again after two sections of comparisons, discussions, | 
|---|
|  | 66 | implementations, and options that are ungrounded because they haven't | 
|---|
|  | 67 | told the reader what they are actually talking about.  First tell the | 
|---|
|  | 68 | reader what something is, then how they might use it (as programmers: | 
|---|
|  | 69 | what are the rules and restrictions) and only then start comparison | 
|---|
|  | 70 | with other things, other approaches, other languages, or | 
|---|
|  | 71 | implementations. | 
|---|
|  | 72 |  | 
|---|
|  | 73 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 74 |  | 
|---|
|  | 75 | The description of the implementation is similarly lost in the trees | 
|---|
|  | 76 | without ever really seeing the wood. Figure 19 is crucial here, but | 
|---|
|  | 77 | it's pretty much at the end of the paper, and comments about | 
|---|
|  | 78 | implementations are threaded throughout the paper without the context | 
|---|
|  | 79 | (fig 19) to understand what's going on. | 
|---|
|  | 80 |  | 
|---|
|  | 81 | We have to agree to disagree on the location of Fig 19. Early discussion about | 
|---|
|  | 82 | implementation for the various control structures are specific to that feature. | 
|---|
|  | 83 | Fig 19 shows the global runtime structure, which manages only the threading | 
|---|
|  | 84 | aspect of the control structures and their global organization. | 
|---|
|  | 85 |  | 
|---|
|  | 86 | The protocol for performance testing may just about suffice for C (although | 
|---|
|  | 87 | is N constantly ten million, or does it vary for each benchmark) | 
|---|
|  | 88 |  | 
|---|
|  | 89 | Fixed, the paper states N varies per language/benchmark so the benchmark runs | 
|---|
|  | 90 | long enough to get a good average per operation. | 
|---|
|  | 91 |  | 
|---|
|  | 92 | but such evaluation isn't appropriate for garbage-collected or JITTed | 
|---|
|  | 93 | languages like Java or Go. | 
|---|
|  | 94 |  | 
|---|
|  | 95 | Please explain. All the actions in the benchmarks occur independently of the | 
|---|
|  | 96 | storage-management scheme, e.g., acquiring a lock is an aspect of execution not | 
|---|
|  | 97 | storage. In fact, garbage-collected or JITTed languages cheat on benchmarks and | 
|---|
|  | 98 | we had to take great care to prevent cheating and measure the actual operation. | 
|---|
|  | 99 |  | 
|---|
|  | 100 | p1 only a subset of C-forall extensions? | 
|---|
|  | 101 |  | 
|---|
|  | 102 | Fixed, removed. | 
|---|
|  | 103 |  | 
|---|
|  | 104 | p1 "has features often associated with object-oriented programming | 
|---|
|  | 105 | languages, such as constructors, destructors, virtuals and simple | 
|---|
|  | 106 | inheritance."  There's no need to quibble about this. Once a language has | 
|---|
|  | 107 | inheritance, it's hard to claim it's not object-oriented. | 
|---|
|  | 108 |  | 
|---|
|  | 109 | We have to agree to disagree. Object languages are defined by the notion of | 
|---|
|  | 110 | nested functions in a aggregate structure with a special receiver parameter | 
|---|
|  | 111 | "this", not by inheritance.  Inheritance is a polymorphic mechanism, e.g, | 
|---|
|  | 112 | Plan-9 C has simple inheritance but is not object-oriented. Because Cforall | 
|---|
|  | 113 | does not have a specific receiver, it is possible to have multiple function | 
|---|
|  | 114 | parameters as receivers, which introduces new concepts like bulk acquire for | 
|---|
|  | 115 | monitors. | 
|---|
|  | 116 |  | 
|---|
|  | 117 | p2 barging? signals-as-hints? | 
|---|
|  | 118 |  | 
|---|
|  | 119 | Added a footnote for barging. We feel these terms are well known in the | 
|---|
|  | 120 | concurrency literature, especially in pthreads and Java, and both terms have | 
|---|
|  | 121 | citations with extensive explanations and further citations. | 
|---|
|  | 122 |  | 
|---|
|  | 123 | p3 start your discussion of generations with a simple example of a | 
|---|
|  | 124 | C-forall generator.  Fig 1(b) might do: but put it inline instead of | 
|---|
|  | 125 | the python example - and explain the key rules and restrictions on the | 
|---|
|  | 126 | construct.  Then don't even start to compare with coroutines until | 
|---|
|  | 127 | you've presented, described and explained your coroutines... | 
|---|
|  | 128 | p3 I'd probably leave out the various "C" versions unless there are | 
|---|
|  | 129 | key points to make you can't make in C-forall. All the alternatives | 
|---|
|  | 130 | are just confusing. | 
|---|
|  | 131 |  | 
|---|
|  | 132 | Hopefully fixed as this block of text has been rewritten. | 
|---|
|  | 133 |  | 
|---|
|  | 134 | p4 but what's that "with" in Fig 1(B) | 
|---|
|  | 135 |  | 
|---|
|  | 136 | Footnote D explains the semantic of "with", which is like unqualified access | 
|---|
|  | 137 | for the receiver to the fields of a class from member routines, i.e., no | 
|---|
|  | 138 | "this->". | 
|---|
|  | 139 |  | 
|---|
|  | 140 | p5 start with the high level features of C-forall generators... | 
|---|
|  | 141 |  | 
|---|
|  | 142 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 143 |  | 
|---|
|  | 144 | p5 why is the paper explaining networking protocols? | 
|---|
|  | 145 |  | 
|---|
|  | 146 | Fixed, added discussion on this point. | 
|---|
|  | 147 |  | 
|---|
|  | 148 | p7 lines 1-9 (transforming generator to coroutine - why would I do any of | 
|---|
|  | 149 | this? Why would I want one instead of the other (do not use "stack" in your | 
|---|
|  | 150 | answer!) | 
|---|
|  | 151 |  | 
|---|
|  | 152 | As stated on line 1 because state declarations from the generator type can be | 
|---|
|  | 153 | moved out of the coroutine type into the coroutine main | 
|---|
|  | 154 |  | 
|---|
|  | 155 | p10 last para "A coroutine must retain its last resumer to suspend back | 
|---|
|  | 156 | because the resumer is on a different stack. These reverse pointers allow | 
|---|
|  | 157 | suspend to cycle backwards, " I've no idea what is going on here?  why | 
|---|
|  | 158 | should I care?  Shouldn't I just be using threads instead?  why not? | 
|---|
|  | 159 |  | 
|---|
|  | 160 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 161 |  | 
|---|
|  | 162 | p16 for the same reasons - what reasons? | 
|---|
|  | 163 |  | 
|---|
|  | 164 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 165 |  | 
|---|
|  | 166 | p17 if the multiple-monitor entry procedure really is novel, write a paper | 
|---|
|  | 167 | about that, and only about that. | 
|---|
|  | 168 |  | 
|---|
|  | 169 | We do not believe this is a practical suggestion. | 
|---|
|  | 170 |  | 
|---|
|  | 171 | p23 "Loose Object Definitions" - no idea what that means.  in that | 
|---|
|  | 172 | section: you can't leave out JS-style dynamic properties.  Even in | 
|---|
|  | 173 | OOLs that (one way or another) allow separate definitions of methods | 
|---|
|  | 174 | (like Objective-C, Swift, Ruby, C#) at any time a runtime class has a | 
|---|
|  | 175 | fixed definition.  Quite why the detail about bit mask implementation | 
|---|
|  | 176 | is here anyway, I've no idea. | 
|---|
|  | 177 |  | 
|---|
|  | 178 | Fixed by rewriting the section. | 
|---|
|  | 179 |  | 
|---|
|  | 180 | p25 this cluster isn't a CLU cluster then? | 
|---|
|  | 181 |  | 
|---|
|  | 182 | No. A CLU cluster is like a class in an object-oriented programming language. | 
|---|
|  | 183 | A CFA cluster is a runtime organizational mechanism. | 
|---|
|  | 184 |  | 
|---|
|  | 185 | * conclusion should conclude the paper, not the related. | 
|---|
|  | 186 |  | 
|---|
|  | 187 | We do not understand this comment. | 
|---|
|  | 188 |  | 
|---|
|  | 189 | ============================================================================= | 
|---|
|  | 190 |  | 
|---|
|  | 191 | Reviewing: 2 | 
|---|
|  | 192 |  | 
|---|
|  | 193 | There is much description of the system and its details, but nothing about | 
|---|
|  | 194 | (non-artificial) uses of it. Although the microbenchmark data is | 
|---|
|  | 195 | encouraging, arguably not enough practical experience with the system has | 
|---|
|  | 196 | been reported here to say much about either its usability advantages or its | 
|---|
|  | 197 | performance. | 
|---|
|  | 198 |  | 
|---|
|  | 199 | We have a Catch-22 problem. Without publicity, there is no user community; | 
|---|
|  | 200 | without a user community, there are no publications for publicity. | 
|---|
|  | 201 |  | 
|---|
|  | 202 | p2: lines 4--9 are a little sloppy. It is not the languages but their | 
|---|
|  | 203 | popular implementations which "adopt" the 1:1 kernel threading model. | 
|---|
|  | 204 |  | 
|---|
|  | 205 | Fixed. | 
|---|
|  | 206 |  | 
|---|
|  | 207 | line 10: "medium work" -- "medium-sized work"? | 
|---|
|  | 208 |  | 
|---|
|  | 209 | Fixed. | 
|---|
|  | 210 |  | 
|---|
|  | 211 | line 18: "is all sequential to the compiler" -- not true in modern | 
|---|
|  | 212 | compilers, and in 2004 H-J Boehm wrote a tech report describing exactly why | 
|---|
|  | 213 | ("Threads cannot be implemented as a library", HP Labs). | 
|---|
|  | 214 |  | 
|---|
|  | 215 | We will have to disagree on this point. First, I am aware of Hans's 2004 paper | 
|---|
|  | 216 | because in that paper Hans cites my seminal work on this topic from 1995, which | 
|---|
|  | 217 | we cite in this paper.  Second, while modern memory-models have been added to | 
|---|
|  | 218 | languages like Java/C/C++ and new languages usually start with a memory model, | 
|---|
|  | 219 | it is still the programmer's responsibility to use them for racy code. Only | 
|---|
|  | 220 | when the programing language provides race-free constructs is the language | 
|---|
|  | 221 | aware of the concurrency; otherwise the code is sequential. Hans's paper "You | 
|---|
|  | 222 | Don't Know Jack About Shared Variables or Memory Models" talks about these | 
|---|
|  | 223 | issues, and is also cited in the paper. | 
|---|
|  | 224 |  | 
|---|
|  | 225 | line 20: "knows the optimization boundaries" -- I found this vague. What's | 
|---|
|  | 226 | an example? | 
|---|
|  | 227 |  | 
|---|
|  | 228 | Fixed. | 
|---|
|  | 229 |  | 
|---|
|  | 230 | line 31: this paragraph has made a lot of claims. Perhaps forward-reference | 
|---|
|  | 231 | to the parts of the paper that discuss each one. | 
|---|
|  | 232 |  | 
|---|
|  | 233 | Fixed by adding a road-map paragraph at the end of the introduction. | 
|---|
|  | 234 |  | 
|---|
|  | 235 | line 33: "so the reader can judge if" -- this reads rather | 
|---|
|  | 236 | passive-aggressively. Perhaps better: "... to support our argument that..." | 
|---|
|  | 237 |  | 
|---|
|  | 238 | Fixed. | 
|---|
|  | 239 |  | 
|---|
|  | 240 | line 41: "a dynamic partitioning mechanism" -- I couldn't tell what this | 
|---|
|  | 241 | meant | 
|---|
|  | 242 |  | 
|---|
|  | 243 | Fixed. | 
|---|
|  | 244 |  | 
|---|
|  | 245 | p3. Presenting concept of a "stateful function" as a new language feature | 
|---|
|  | 246 | seems odd. In C, functions often have local state thanks to static local | 
|---|
|  | 247 | variables (or globals, indeed). Of course, that has several | 
|---|
|  | 248 | limitations. Can you perhaps present your contributions by enumerating | 
|---|
|  | 249 | these limitations? See also my suggestion below about a possible framing | 
|---|
|  | 250 | centred on a strawman. | 
|---|
|  | 251 |  | 
|---|
|  | 252 | Fixed, at the start of this section. | 
|---|
|  | 253 |  | 
|---|
|  | 254 | line 2: "an old idea that is new again" -- this is too oblique | 
|---|
|  | 255 |  | 
|---|
|  | 256 | Fixed, removed. | 
|---|
|  | 257 |  | 
|---|
|  | 258 | lines 2--15: I found this to be a word/concept soup. Stacks, closures, | 
|---|
|  | 259 | generators, stackless stackful, coroutine, symmetric, asymmetric, | 
|---|
|  | 260 | resume/suspend versus resume/resume... there needs to be a more gradual and | 
|---|
|  | 261 | structured way to introduce all this, and ideally one that minimises | 
|---|
|  | 262 | redundancy. Maybe present it as a series of "definitions" each with its own | 
|---|
|  | 263 | heading, e.g. "A closure is stackless if its local state has statically | 
|---|
|  | 264 | known fixed size"; "A generator simply means a stackless closure." And so | 
|---|
|  | 265 | on. Perhaps also strongly introduce the word "activate" as a direct | 
|---|
|  | 266 | contrast with resume and suspend. These are just a flavour of the sort of | 
|---|
|  | 267 | changes that might make this paragraph into something readable. | 
|---|
|  | 268 |  | 
|---|
|  | 269 | Continuing the thought: I found it confusing that by these definitions, a | 
|---|
|  | 270 | stackful closure is not a stack, even though logically the stack *is* a | 
|---|
|  | 271 | kind of closure (it is a representation of the current thread's | 
|---|
|  | 272 | continuation). | 
|---|
|  | 273 |  | 
|---|
|  | 274 | Fixed. Rewrote paragraph and moved Table 1 forward. | 
|---|
|  | 275 |  | 
|---|
|  | 276 | lines 24--27: without explaining what the boost functor types mean, I don't | 
|---|
|  | 277 | think the point here comes across. | 
|---|
|  | 278 |  | 
|---|
|  | 279 | Replaced with uC++ example because boost appears to have dropped symmetric | 
|---|
|  | 280 | coroutines. | 
|---|
|  | 281 |  | 
|---|
|  | 282 | line 34: "semantically coupled" -- I wasn't sure what this meant | 
|---|
|  | 283 |  | 
|---|
|  | 284 | Fixed. | 
|---|
|  | 285 |  | 
|---|
|  | 286 | p4: the point of Figure 1 (C) was not immediately clear. It seem to be | 
|---|
|  | 287 | showing how one might "compile down" Figure 1 (B). Or is that Figure 1 (A)? | 
|---|
|  | 288 |  | 
|---|
|  | 289 | Fixed. Rewrote sentence. | 
|---|
|  | 290 |  | 
|---|
|  | 291 | It's right that the incidental language features of the system are not | 
|---|
|  | 292 | front-and-centre, but I'd appreciate some brief glossing of non-C languages | 
|---|
|  | 293 | features as they appear. Examples are the square bracket notation, the pipe | 
|---|
|  | 294 | notation and the constructor syntax. These explanations could go in the | 
|---|
|  | 295 | caption of the figure which first uses them, perhaps. Overall I found the | 
|---|
|  | 296 | figure captions to be terse, and a missed opportunity to explain clearly | 
|---|
|  | 297 | what was going on. | 
|---|
|  | 298 |  | 
|---|
|  | 299 | Fixed, added descriptive footnote about Cforall. We prefer to put text in the | 
|---|
|  | 300 | body of the paper and keep captions short. | 
|---|
|  | 301 |  | 
|---|
|  | 302 | p5 line 23: "This restriction is removed..." -- give us some up-front | 
|---|
|  | 303 | summary of your contributions and the elements of the language design that | 
|---|
|  | 304 | will be talked about, so that this isn't an aside. This will reduce the | 
|---|
|  | 305 | "twisty passages" feeling that characterises much of the paper. | 
|---|
|  | 306 |  | 
|---|
|  | 307 | Fixed, remove parenthesis. | 
|---|
|  | 308 |  | 
|---|
|  | 309 | line 40: "a killer asymmetric generator" -- this is stylistically odd, and | 
|---|
|  | 310 | the sentence about failures doesn't convincingly argue that C\/ will help | 
|---|
|  | 311 | with them. Have you any experience writing device drivers using C\/? Or any | 
|---|
|  | 312 | argument that the kinds of failures can be traced to the "stack-ripping" | 
|---|
|  | 313 | style that one is forced to use without coroutines ? | 
|---|
|  | 314 |  | 
|---|
|  | 315 | Fixed, added new paragraph. | 
|---|
|  | 316 |  | 
|---|
|  | 317 | Also, a typo on line | 
|---|
|  | 318 | 41: "device drives". And saying "Windows/Linux" is sloppy... what does the | 
|---|
|  | 319 | cited paper actually say? | 
|---|
|  | 320 |  | 
|---|
|  | 321 | Fixed. | 
|---|
|  | 322 |  | 
|---|
|  | 323 | p6 lines 13--23: this paragraph is difficult to understand. It seems to be | 
|---|
|  | 324 | talking about a control-flow pattern roughly equivalent to tail recursion. | 
|---|
|  | 325 | What is the high-level point, other than that this is possible? | 
|---|
|  | 326 |  | 
|---|
|  | 327 | Fixed, rewrote start of the paragraph. | 
|---|
|  | 328 |  | 
|---|
|  | 329 | line 34: "which they call coroutines" -- a better way to make this point is | 
|---|
|  | 330 | presumably that the C++20 proposal only provides a specialised kind of | 
|---|
|  | 331 | coroutine, namely generators, despite its use of the more general word. | 
|---|
|  | 332 |  | 
|---|
|  | 333 | Fixed. | 
|---|
|  | 334 |  | 
|---|
|  | 335 | line 47: "... due to dynamic stack allocation, execution..." -- this | 
|---|
|  | 336 | sentence doesn't scan. I suggest adding "and for" in the relevant places | 
|---|
|  | 337 | where currently there are only commas. | 
|---|
|  | 338 |  | 
|---|
|  | 339 | Fixed. | 
|---|
|  | 340 |  | 
|---|
|  | 341 | p8 / Figure 5 (B) -- the GNU C extension of unary "&&" needs to be | 
|---|
|  | 342 | explained. | 
|---|
|  | 343 |  | 
|---|
|  | 344 | Fixed, added explanation at first usage in Figure 1(C) and reference. | 
|---|
|  | 345 |  | 
|---|
|  | 346 | The whole figure needs a better explanation, in fact. | 
|---|
|  | 347 |  | 
|---|
|  | 348 | Fixed, rewrote start of the paragraph. | 
|---|
|  | 349 |  | 
|---|
|  | 350 | p9, lines 1--10: I wasn't sure this stepping-through really added much | 
|---|
|  | 351 | value. What are the truly important points to note about this code? | 
|---|
|  | 352 |  | 
|---|
|  | 353 | Fixed, shortened and merged with previous paragraph. | 
|---|
|  | 354 |  | 
|---|
|  | 355 | p10: similarly, lines 3--27 again are somewhere between tedious and | 
|---|
|  | 356 | confusing. I'm sure the motivation and details of "starter semantics" can | 
|---|
|  | 357 | both be stated much more pithily. | 
|---|
|  | 358 |  | 
|---|
|  | 359 | Fixed, shortened these paragraphs. | 
|---|
|  | 360 |  | 
|---|
|  | 361 | line 32: "a self-resume does not overwrite the last resumer" -- is this a | 
|---|
|  | 362 | hack or a defensible principled decision? | 
|---|
|  | 363 |  | 
|---|
|  | 364 | Fixed, removed but it is a defensible principled decision. | 
|---|
|  | 365 |  | 
|---|
|  | 366 | p11: "a common source of errors" -- among beginners or among production | 
|---|
|  | 367 | code? Presumably the former. | 
|---|
|  | 368 |  | 
|---|
|  | 369 | Forgetting is not specific to beginners. | 
|---|
|  | 370 |  | 
|---|
|  | 371 | line 23: "with builtin and library" -- not sure what this means | 
|---|
|  | 372 |  | 
|---|
|  | 373 | Fixed. | 
|---|
|  | 374 |  | 
|---|
|  | 375 | lines 31--36: these can be much briefer. The only important point here | 
|---|
|  | 376 | seems to be that coroutines cannot be copied. | 
|---|
|  | 377 |  | 
|---|
|  | 378 | Fixed, shortened. | 
|---|
|  | 379 |  | 
|---|
|  | 380 | p12: line 1: what is a "task"? Does it matter? | 
|---|
|  | 381 |  | 
|---|
|  | 382 | Fixed, "task" has been changed to "thread" throughout the paper. | 
|---|
|  | 383 |  | 
|---|
|  | 384 | line 7: calling it "heap stack" seems to be a recipe for | 
|---|
|  | 385 | confusion. "Stack-and-heap" might be better, and contrast with | 
|---|
|  | 386 | "stack-and-VLS" perhaps. When "VLS" is glossed, suggest actually expanding | 
|---|
|  | 387 | its initials: say "length" not "size". | 
|---|
|  | 388 |  | 
|---|
|  | 389 | Fixed, make correction and rewrote some of the text. | 
|---|
|  | 390 |  | 
|---|
|  | 391 | line 21: are you saying "cooperative threading" is the same as | 
|---|
|  | 392 | "non-preemptive scheduling", or that one is a special case (kind) of the | 
|---|
|  | 393 | other? Both are defensible, but be clear. | 
|---|
|  | 394 |  | 
|---|
|  | 395 | Fixed, clarified the definitions. | 
|---|
|  | 396 |  | 
|---|
|  | 397 | line 27: "mutual exclusion and synchronization" -- the former is a kind of | 
|---|
|  | 398 | the latter, so I suggest "and other forms of synchronization". | 
|---|
|  | 399 |  | 
|---|
|  | 400 | We have to agree to disagree. Included a citation that explains the | 
|---|
|  | 401 | differences. | 
|---|
|  | 402 |  | 
|---|
|  | 403 | line 30: "can either be a stackless or stackful" -- stray "a", but also, | 
|---|
|  | 404 | this seems to be switching from generic/background terminology to | 
|---|
|  | 405 | C\/-specific terminology. | 
|---|
|  | 406 |  | 
|---|
|  | 407 | Fixed, but the terms stackless or stackful are not specific to Cforall; they | 
|---|
|  | 408 | are well known in the literature. | 
|---|
|  | 409 |  | 
|---|
|  | 410 | An expositional idea occurs: start the paper with a strawman naive/limited | 
|---|
|  | 411 | realisation of coroutines -- say, Simon Tatham's popular "Coroutines in C" | 
|---|
|  | 412 | web page -- and identify point by point what the limitations are and how | 
|---|
|  | 413 | C\/ overcomes them. Currently the presentation is often flat (lacking | 
|---|
|  | 414 | motivating contrasts) and backwards (stating solutions before | 
|---|
|  | 415 | problems). The foregoing approach might fix both of these. | 
|---|
|  | 416 |  | 
|---|
|  | 417 | We prefer the current structure of our paper and believe the paper does | 
|---|
|  | 418 | explain basic coding limitations and how they are overcome in using high-level | 
|---|
|  | 419 | control-floe mechanisms. | 
|---|
|  | 420 |  | 
|---|
|  | 421 | page 13: line 23: it seems a distraction to mention the Python feature | 
|---|
|  | 422 | here. | 
|---|
|  | 423 |  | 
|---|
|  | 424 | Why? It is the first location in the paper where dynamic allocation and | 
|---|
|  | 425 | initialization are mentioned. | 
|---|
|  | 426 |  | 
|---|
|  | 427 | p14 line 5: it seems odd to describe these as "stateless" just because they | 
|---|
|  | 428 | lack shared mutable state. It means the code itself is even more | 
|---|
|  | 429 | stateful. Maybe the "stack ripping" argument could usefully be given here. | 
|---|
|  | 430 |  | 
|---|
|  | 431 | Fixed, changed "stateless" to "non-shared". | 
|---|
|  | 432 |  | 
|---|
|  | 433 | line 16: "too restrictive" -- would be good to have a reference to justify | 
|---|
|  | 434 | this, or at least give a sense of what the state-of-the-art performance in | 
|---|
|  | 435 | transactional memory systems is (both software and hardware) | 
|---|
|  | 436 |  | 
|---|
|  | 437 | Fixed, added 2 citations. | 
|---|
|  | 438 |  | 
|---|
|  | 439 | line 22: "simulate monitors" -- what about just *implementing* monitors? | 
|---|
|  | 440 | isn't that what these systems do? or is the point more about refining them | 
|---|
|  | 441 | somehow into something more specialised? | 
|---|
|  | 442 |  | 
|---|
|  | 443 | Fixed, changed "simulate monitors" to "manually implement a monitor". | 
|---|
|  | 444 |  | 
|---|
|  | 445 | p15: sections 4.1 and 4.2 seem adrift and misplaced. Split them into basic | 
|---|
|  | 446 | parts (which go earlier) and more advanced parts (e.g. barging, which can | 
|---|
|  | 447 | be explained later). | 
|---|
|  | 448 |  | 
|---|
|  | 449 | Fixed, removed them by shortening and merging with previous section. | 
|---|
|  | 450 |  | 
|---|
|  | 451 | line 31: "acquire/release" -- misses an opportunity to contrast the | 
|---|
|  | 452 | monitor's "enter/exit" abstraction with the less structured acquire/release | 
|---|
|  | 453 | of locks. | 
|---|
|  | 454 |  | 
|---|
|  | 455 | Fixed, added "by call/return" in sentence. | 
|---|
|  | 456 |  | 
|---|
|  | 457 | p16 line 12: the "implicit" versus "explicit" point is unclear. Is it | 
|---|
|  | 458 | perhaps about the contract between an opt-in *discipline* and a | 
|---|
|  | 459 | language-enforced *guarantee*? | 
|---|
|  | 460 |  | 
|---|
|  | 461 | Fixed. | 
|---|
|  | 462 |  | 
|---|
|  | 463 | line 28: no need to spend ages dithering about which one is default and | 
|---|
|  | 464 | which one is the explicit qualifier. Tell us what you decided, briefly | 
|---|
|  | 465 | justify it, and move on. | 
|---|
|  | 466 |  | 
|---|
|  | 467 | Fixed, shortened paragraph. | 
|---|
|  | 468 |  | 
|---|
|  | 469 | p17: Figure 11: since the main point seems to be to highlight bulk acquire, | 
|---|
|  | 470 | include a comment which identifies the line where this is happening. | 
|---|
|  | 471 |  | 
|---|
|  | 472 | Fixed. | 
|---|
|  | 473 |  | 
|---|
|  | 474 | line 2: "impossible to statically..." -- or dynamically. Doing it | 
|---|
|  | 475 | dynamically would be perfectly acceptable (locking is a dynamic operation | 
|---|
|  | 476 | after all) | 
|---|
|  | 477 |  | 
|---|
|  | 478 | Fixed, clarified the "statically" applied to the unknown-sized pointer types. | 
|---|
|  | 479 |  | 
|---|
|  | 480 | "guarantees acquisition order is consistent" -- assuming it's done in a | 
|---|
|  | 481 | single bulk acquire. | 
|---|
|  | 482 |  | 
|---|
|  | 483 | Fixed. | 
|---|
|  | 484 |  | 
|---|
|  | 485 | p18: section 5.3: the text here is a mess. The explanations of "internal" | 
|---|
|  | 486 | versus "external" scheduling are unclear, and "signals as hints" is not | 
|---|
|  | 487 | explained. "... can cause thread starvation" -- means including a while | 
|---|
|  | 488 | loop, or not doing so? "There are three signalling mechanisms.." but the | 
|---|
|  | 489 | text does not follow that by telling us what they are. My own scribbled | 
|---|
|  | 490 | attempt at unpicking the internal/external thing: "threads already in the | 
|---|
|  | 491 | monitor, albeit waiting, have priority over those trying to enter". | 
|---|
|  | 492 |  | 
|---|
|  | 493 | Fixed, rewrote and shortened paragraphs. | 
|---|
|  | 494 |  | 
|---|
|  | 495 | p19: line 3: "empty condition" -- explain that condition variables don't | 
|---|
|  | 496 | store anything. So being "empty" means that the queue of waiting threads | 
|---|
|  | 497 | (threads waiting to be signalled that the condition has become true) is | 
|---|
|  | 498 | empty. | 
|---|
|  | 499 |  | 
|---|
|  | 500 | Fixed, changed condition variable to condition queue throughout the paper. | 
|---|
|  | 501 |  | 
|---|
|  | 502 | line 6: "... can be transformed into external scheduling..." -- OK, but | 
|---|
|  | 503 | give some motivation. | 
|---|
|  | 504 |  | 
|---|
|  | 505 | The paper states that it removes the condition queues and signal/wait. Changed | 
|---|
|  | 506 | "transform" to "simplified". | 
|---|
|  | 507 |  | 
|---|
|  | 508 | p20: line 6: "mechnaism" | 
|---|
|  | 509 |  | 
|---|
|  | 510 | Fixed. | 
|---|
|  | 511 |  | 
|---|
|  | 512 | lines 16--20: this is dense and can probably only be made clear with an | 
|---|
|  | 513 | example | 
|---|
|  | 514 |  | 
|---|
|  | 515 | Fixed, rewrote and added example. | 
|---|
|  | 516 |  | 
|---|
|  | 517 | p21 line 21: clarify that nested monitor deadlock was describe earlier (in | 
|---|
|  | 518 | 5.2). (Is the repetition necessary?) | 
|---|
|  | 519 |  | 
|---|
|  | 520 | Fixed, put in a forward reference, and the point bears repeating because | 
|---|
|  | 521 | releasing a subset of acquired monitors in unique to Cforall concurrency. | 
|---|
|  | 522 |  | 
|---|
|  | 523 | line 27: "locks, and by extension monitors" -- this is true but the "by | 
|---|
|  | 524 | extension" argument is faulty. It is perfectly possible to use locks as a | 
|---|
|  | 525 | primitive and build a compositional mechanism out of them, | 
|---|
|  | 526 | e.g. transactions. | 
|---|
|  | 527 |  | 
|---|
|  | 528 | True, but that is not what we said. Locks are not composable, monitors are | 
|---|
|  | 529 | built using locks not transactions, so by extension monitors are not composable. | 
|---|
|  | 530 |  | 
|---|
|  | 531 | p22 line 2: should say "restructured" | 
|---|
|  | 532 |  | 
|---|
|  | 533 | Fixed. | 
|---|
|  | 534 |  | 
|---|
|  | 535 | line 33: "Implementing a fast subset check..." -- make clear that the | 
|---|
|  | 536 | following section explains how to do this. Restructuring the sections | 
|---|
|  | 537 | themselves could do this, or noting in the text. | 
|---|
|  | 538 |  | 
|---|
|  | 539 | Fixed, added a forward reference to the following sections. | 
|---|
|  | 540 |  | 
|---|
|  | 541 | p23: line 3: "dynamic member adding, eg, JavaScript" -- needs to say "as | 
|---|
|  | 542 | permitted in JavaScript", and "dynamically adding members" is stylistically | 
|---|
|  | 543 | better | 
|---|
|  | 544 |  | 
|---|
|  | 545 | Fixed. | 
|---|
|  | 546 |  | 
|---|
|  | 547 | p23: line 18: "urgent stack" -- back-reference to where this was explained | 
|---|
|  | 548 | before | 
|---|
|  | 549 |  | 
|---|
|  | 550 | Fixed. | 
|---|
|  | 551 |  | 
|---|
|  | 552 | p24 line 7: I did not understand what was more "direct" about "direct | 
|---|
|  | 553 | communication". Also, what is a "passive monitor" -- just a monitor, given | 
|---|
|  | 554 | that monitors are passive by design? | 
|---|
|  | 555 |  | 
|---|
|  | 556 | The back half of line 7 defines "direct". For example, Go, Java, pthread | 
|---|
|  | 557 | threads cannot directly call/communicate with one another, where they can in | 
|---|
|  | 558 | Ada, uC++, and Cforall threads. Figure 18 show this exact difference. | 
|---|
|  | 559 |  | 
|---|
|  | 560 | A monitor object is *passive* because it does not have a thread, while a Go, | 
|---|
|  | 561 | Java, Cforall "thread" object is *active* because it has a thread. | 
|---|
|  | 562 |  | 
|---|
|  | 563 | line 14 / section 5.9: this table was useful and it (or something like it) | 
|---|
|  | 564 | could be used much earlier on to set the structure of the rest of the | 
|---|
|  | 565 | paper. | 
|---|
|  | 566 |  | 
|---|
|  | 567 | Fixed, Table 1 is moved to the start and explained in detail. | 
|---|
|  | 568 |  | 
|---|
|  | 569 | The explanation at present is too brief, e.g. I did not really understand | 
|---|
|  | 570 | the point about cases 7 and 8. Table 1: what does "No / Yes" mean? | 
|---|
|  | 571 |  | 
|---|
|  | 572 | Fixed, expanded the explanation. | 
|---|
|  | 573 |  | 
|---|
|  | 574 | p25 line 2: instead of casually dropping in a terse explanation for the | 
|---|
|  | 575 | newly introduced term "virtual processor", introduce it | 
|---|
|  | 576 | properly. Presumably the point is to give a less ambiguous meaning to | 
|---|
|  | 577 | "thread" by reserving it only for C\/'s green threads. | 
|---|
|  | 578 |  | 
|---|
|  | 579 | Fixed. | 
|---|
|  | 580 |  | 
|---|
|  | 581 | p26 line 15: "transforms user threads into fibres" -- a reference is needed | 
|---|
|  | 582 | to explain what "fibres" means... guessing it's in the sense of Adya et al. | 
|---|
|  | 583 |  | 
|---|
|  | 584 | Fixed. In a prior correct, the term fibre from Adya is defined. | 
|---|
|  | 585 |  | 
|---|
|  | 586 | line 20: "Microsoft runtime" -- means Windows? | 
|---|
|  | 587 |  | 
|---|
|  | 588 | Fixed. | 
|---|
|  | 589 |  | 
|---|
|  | 590 | lines 21--26: don't say "interrupt" to mean "signal", especially not | 
|---|
|  | 591 | without clear introduction. You can use "POSIX signal" to disambiguate from | 
|---|
|  | 592 | condition variables' "signal". | 
|---|
|  | 593 |  | 
|---|
|  | 594 | We have to agree to disagree on this terminology. Interrupt is the action of | 
|---|
|  | 595 | stopping the CPU while a signal is a specific kind of interrupt. The two terms | 
|---|
|  | 596 | seem to be well understood in the literature. | 
|---|
|  | 597 |  | 
|---|
|  | 598 | p27 line 3: "frequency is usually long" -- that's a "time period" or | 
|---|
|  | 599 | "interval", not a frequency | 
|---|
|  | 600 |  | 
|---|
|  | 601 | Fixed. | 
|---|
|  | 602 |  | 
|---|
|  | 603 | line 5: the lengthy quotation is not really necessary; just paraphrase the | 
|---|
|  | 604 | first sentence and move on. | 
|---|
|  | 605 |  | 
|---|
|  | 606 | Fixed. | 
|---|
|  | 607 |  | 
|---|
|  | 608 | line 20: "to verify the implementation" -- I don't think that means what is | 
|---|
|  | 609 | intended | 
|---|
|  | 610 |  | 
|---|
|  | 611 | Fixed, changed "verify" to "test". | 
|---|
|  | 612 |  | 
|---|
|  | 613 | Tables in section 7 -- too many significant figures. How many overall runs | 
|---|
|  | 614 | are described? What is N in each case? | 
|---|
|  | 615 |  | 
|---|
|  | 616 | Fixed. As stated, N=31. | 
|---|
|  | 617 |  | 
|---|
|  | 618 | p29 line 2: "to eliminate this cost" -- arguably confusing since nowadays | 
|---|
|  | 619 | on commodity CPUs most of the benefits of inlining are not to do with call | 
|---|
|  | 620 | overheads, but from later optimizations enabled as a consequence of the | 
|---|
|  | 621 | inlining | 
|---|
|  | 622 |  | 
|---|
|  | 623 | Fixed. | 
|---|
|  | 624 |  | 
|---|
|  | 625 | line 41: "a hierarchy" -- are they a hierarchy? If so, this could be | 
|---|
|  | 626 | explained earlier. Also, to say these make up "an integrated set... of | 
|---|
|  | 627 | control-flow features" verges on the tautologous. | 
|---|
|  | 628 |  | 
|---|
|  | 629 | Fixed, rewrote sentence. | 
|---|
|  | 630 |  | 
|---|
|  | 631 | p30 line 15: "a common case being web servers and XaaS" -- that's two cases | 
|---|
|  | 632 |  | 
|---|
|  | 633 | Fixed. | 
|---|
|  | 634 |  | 
|---|
|  | 635 | ============================================================================ | 
|---|
|  | 636 |  | 
|---|
|  | 637 | Reviewing: 3 | 
|---|
|  | 638 |  | 
|---|
|  | 639 | * Expand on the motivations for including both generator and coroutines, vs | 
|---|
|  | 640 | trying to build one atop the other | 
|---|
|  | 641 |  | 
|---|
|  | 642 | Fixed, Table 1 is moved to the start and explained in detail. | 
|---|
|  | 643 |  | 
|---|
|  | 644 | * Expand on the motivations for having both symmetric and asymmetric | 
|---|
|  | 645 | coroutines? | 
|---|
|  | 646 |  | 
|---|
|  | 647 | A coroutine is not marked as symmetric or asymmetric, it is a coroutine. | 
|---|
|  | 648 | Symmetric or asymmetric is a stylistic use of a coroutine. By analogy, a | 
|---|
|  | 649 | function is not marked as recursive or non-recursive. Recursion is a style of | 
|---|
|  | 650 | programming with a function. So there is no notion of motivation for having | 
|---|
|  | 651 | both symmetric and asymmetric as they follow from how a programmer uses suspend | 
|---|
|  | 652 | and resume. | 
|---|
|  | 653 |  | 
|---|
|  | 654 | * Comparison to async-await model adopted by other languages | 
|---|
|  | 655 |  | 
|---|
|  | 656 | Fixed, added a new section on this topic. | 
|---|
|  | 657 |  | 
|---|
|  | 658 | * Consider performance comparisons against node.js and Rust frameworks | 
|---|
|  | 659 |  | 
|---|
|  | 660 | Fixed. | 
|---|
|  | 661 |  | 
|---|
|  | 662 | * Discuss performance of monitors vs finer-grained memory models and atomic | 
|---|
|  | 663 | operations found in other languages | 
|---|
|  | 664 |  | 
|---|
|  | 665 | The paper never suggested high-level concurrency constructs can or should | 
|---|
|  | 666 | replace race programming or hardware atomics. The paper suggests programmers | 
|---|
|  | 667 | use high-level constructs when and where is it feasible because they are easy | 
|---|
|  | 668 | and safer to use. The monitor example of an atomic counter is just that, an | 
|---|
|  | 669 | example, not the way it should be done if maximal performance is required.  We | 
|---|
|  | 670 | have tried to make this point clear in the paper. | 
|---|
|  | 671 |  | 
|---|
|  | 672 | * Why both internal/external scheduling for synchronization? | 
|---|
|  | 673 |  | 
|---|
|  | 674 | Some additional motivation has been added. | 
|---|
|  | 675 |  | 
|---|
|  | 676 | * Generators are not exposed as a "function" that returns a generator | 
|---|
|  | 677 | object, but rather as a kind of struct, with communication happening via | 
|---|
|  | 678 | mutable state instead of "return values". | 
|---|
|  | 679 |  | 
|---|
|  | 680 | Yes, Cforall uses an object-style of coroutine, which allows multiple interface | 
|---|
|  | 681 | functions that pass and return values through a structure. This approach allows | 
|---|
|  | 682 | a generator function to have different kinds of return values and different | 
|---|
|  | 683 | kinds of parameters to produce those values. Our generators can provide this | 
|---|
|  | 684 | capability via multiple interface functions to the generator/coroutine state, | 
|---|
|  | 685 | which is discussed on page 5, lines 13-21. | 
|---|
|  | 686 |  | 
|---|
|  | 687 | That is, the generator must be manually resumed and (if I understood) it | 
|---|
|  | 688 | is expected to store values that can then later be read (perhaps via | 
|---|
|  | 689 | methods), instead of having a `yield <Expr>` statement that yields up a | 
|---|
|  | 690 | value explicitly. | 
|---|
|  | 691 |  | 
|---|
|  | 692 | All generators are manually resumed, e.g., Python/nodejs use "next" to resume a | 
|---|
|  | 693 | generator. Yes, yield <Expr> has a single interface with one input/return type, | 
|---|
|  | 694 | versus the Cforall approach allowing arbitrary number of interfaces of | 
|---|
|  | 695 | arbitrary types. | 
|---|
|  | 696 |  | 
|---|
|  | 697 | * Both "symmetric" and "asymmetric" generators are supported, instead of | 
|---|
|  | 698 | only asymmetric. | 
|---|
|  | 699 |  | 
|---|
|  | 700 | Yes, because they support different functionality as discussed in Chris | 
|---|
|  | 701 | Marlin's seminal work and both forms are implemented in Simula67. We did not | 
|---|
|  | 702 | invent symmetric and asymmetric generators/coroutines, we took them from the | 
|---|
|  | 703 | literature. | 
|---|
|  | 704 |  | 
|---|
|  | 705 | * Coroutines (multi-frame generators) are an explicit mechanism. | 
|---|
|  | 706 |  | 
|---|
|  | 707 | In most other languages, coroutines are rather built by layering | 
|---|
|  | 708 | single-frame generators atop one another (e.g., using a mechanism like | 
|---|
|  | 709 | async-await), | 
|---|
|  | 710 |  | 
|---|
|  | 711 | We disagree. Node.js has async-await but has a separate coroutine feature. | 
|---|
|  | 712 | While there are claims that coroutines can be built from async-await and/or | 
|---|
|  | 713 | continuations, in actuality they cannot. | 
|---|
|  | 714 |  | 
|---|
|  | 715 | and symmetric coroutines are basically not supported. I'd like to see a bit | 
|---|
|  | 716 | more justification for Cforall including all the above mechanisms -- it | 
|---|
|  | 717 | seemed like symmetric coroutines were a useful building block for some of | 
|---|
|  | 718 | the user-space threading and custom scheduler mechanisms that were briefly | 
|---|
|  | 719 | mentioned later in the paper. | 
|---|
|  | 720 |  | 
|---|
|  | 721 | Hopefully fixed by moving Table 1 forward. | 
|---|
|  | 722 |  | 
|---|
|  | 723 | In the discussion of coroutines, I would have expected a bit more of a | 
|---|
|  | 724 | comparison to the async-await mechanism offered in other languages. | 
|---|
|  | 725 |  | 
|---|
|  | 726 | We added a new section at the start to point out there is no comparison between | 
|---|
|  | 727 | coroutines and async-await. | 
|---|
|  | 728 |  | 
|---|
|  | 729 | Certainly the semantics of async-await in JavaScript implies | 
|---|
|  | 730 | significantly more overhead (because each async fn is a distinct heap | 
|---|
|  | 731 | object). [Rust's approach avoids this overhead][zc], however, and might be | 
|---|
|  | 732 | worthy of a comparison (see the Performance section). | 
|---|
|  | 733 |  | 
|---|
|  | 734 | We could not get Rust async-await to work, and when reading the description of | 
|---|
|  | 735 | rust async-await, it appears to be Java-style executors with futures (possibly | 
|---|
|  | 736 | fast futures). | 
|---|
|  | 737 |  | 
|---|
|  | 738 | There are several sections in the paper that compare against atomics -- for | 
|---|
|  | 739 | example, on page 15, the paper shows a simple monitor that encapsulates an | 
|---|
|  | 740 | integer and compares that to C++ atomics. Later, the paper compares the | 
|---|
|  | 741 | simplicity of monitors against the `volatile` quantifier from Java. The | 
|---|
|  | 742 | conclusion in section 8 also revisits this point. | 
|---|
|  | 743 | While I agree that monitors are simpler, they are obviously also | 
|---|
|  | 744 | significantly different from a performance perspective -- the paper doesn't | 
|---|
|  | 745 | seem to address this at all. It's plausible that (e.g.) the `Aint` monitor | 
|---|
|  | 746 | type described in the paper can be compiled and mapped to the specialized | 
|---|
|  | 747 | instructions offered by hardware, but I didn't see any mention of how this | 
|---|
|  | 748 | would be done. | 
|---|
|  | 749 |  | 
|---|
|  | 750 | Fixed, see response above. | 
|---|
|  | 751 |  | 
|---|
|  | 752 | There is also no mention of the more nuanced memory ordering | 
|---|
|  | 753 | relations offered by C++11 and how one might achieve similar performance | 
|---|
|  | 754 | characteristics in Cforall (perhaps the answer is that one simply doesn't | 
|---|
|  | 755 | need to; I think that's defensible, but worth stating explicitly). | 
|---|
|  | 756 |  | 
|---|
|  | 757 | Cforall is built on C, and therefore has full access to all the gcc atomics, | 
|---|
|  | 758 | and automatically gets any gcc updates.  Furthermore, section 6.9 states that | 
|---|
|  | 759 | Cforall provides the full panoply of low-level locks, as does Java, Go, C++, | 
|---|
|  | 760 | for performance programming. | 
|---|
|  | 761 |  | 
|---|
|  | 762 | Cforall includes both internal and external scheduling; I found the | 
|---|
|  | 763 | explanation for the external scheduling mechanism to be lacking in | 
|---|
|  | 764 | justification. Why include both mechanisms when most languages seem to make | 
|---|
|  | 765 | do with only internal scheduling? It would be useful to show some scenarios | 
|---|
|  | 766 | where external scheduling is truly more powerful. | 
|---|
|  | 767 |  | 
|---|
|  | 768 | Fixed. Pointed out external scheduling is simpler as part of rewriting in that | 
|---|
|  | 769 | section, and added additional examples. | 
|---|
|  | 770 |  | 
|---|
|  | 771 | I would have liked to see some more discussion of external scheduling and | 
|---|
|  | 772 | how it interacts with software engineering best practices. It seems | 
|---|
|  | 773 | somewhat similar to AOP in certain regards. It seems to add a bit of "extra | 
|---|
|  | 774 | semantics" to monitor methods, in that any method may now also become a | 
|---|
|  | 775 | kind of synchronization point. | 
|---|
|  | 776 |  | 
|---|
|  | 777 | Fixed somewhat. Pointed out that external scheduling has been around for a long | 
|---|
|  | 778 | time (40 years) in Ada, so there is a body of the software-engineering | 
|---|
|  | 779 | experience using it. As well, I have been teaching it for 30 years in the | 
|---|
|  | 780 | concurrency course at Waterloo. We don't know what software engineering best | 
|---|
|  | 781 | practices you imagine it interacting with. Yes, monitor functions are | 
|---|
|  | 782 | synchronization points with external scheduling. | 
|---|
|  | 783 |  | 
|---|
|  | 784 | The "open-ended" nature of this feels like it could easily lead to subtle | 
|---|
|  | 785 | bugs, particularly when code refactoring occurs (which may e.g. split an | 
|---|
|  | 786 | existing method into two). | 
|---|
|  | 787 |  | 
|---|
|  | 788 | Any time a public interface is refactored, it invalids existing calls, so there | 
|---|
|  | 789 | is always an issue. For mutex routines and external scheduling, the waitfor | 
|---|
|  | 790 | statements may have to be updated, but that update is part of the refactoring. | 
|---|
|  | 791 |  | 
|---|
|  | 792 | This seems particularly true if external scheduling can occur across | 
|---|
|  | 793 | compilation units -- the paper suggested that this is true, but I wasn't | 
|---|
|  | 794 | entirely clear. | 
|---|
|  | 795 |  | 
|---|
|  | 796 | Every aspect of Cforall allows separate compilation. The function prototypes | 
|---|
|  | 797 | necessary for separate compilation provide all the information necessary to | 
|---|
|  | 798 | compile any aspect of a program. | 
|---|
|  | 799 |  | 
|---|
|  | 800 | I would have also appreciated a few more details on how external scheduling | 
|---|
|  | 801 | is implemented. It seems to me that there must be some sort of "hooks" on | 
|---|
|  | 802 | mutex methods so that they can detect whether some other function is | 
|---|
|  | 803 | waiting on them and awaken those blocked threads. I'm not sure how such | 
|---|
|  | 804 | hooks are inserted, particularly across compilation units. | 
|---|
|  | 805 |  | 
|---|
|  | 806 | Hooks are inserted by the Cforall translator, in the same way that Java | 
|---|
|  | 807 | inserted hooks into a "synchronized" member of a monitor. As for Java, as long | 
|---|
|  | 808 | as the type information is consistent across compilation units, the correct | 
|---|
|  | 809 | code is inserted. | 
|---|
|  | 810 |  | 
|---|
|  | 811 | The material in Section 5.6 didn't quite clarify the matter for me. For | 
|---|
|  | 812 | example, it left me somewhat confused about whether the `f` and `g` | 
|---|
|  | 813 | functions declared were meant to be local to a translation unit, or shared | 
|---|
|  | 814 | with other unit. | 
|---|
|  | 815 |  | 
|---|
|  | 816 | There are no restrictions with respect to static or external mutex functions. | 
|---|
|  | 817 | Cforall is C. Any form of access or separate compilation in C applies to | 
|---|
|  | 818 | Cforall. As in C, function prototypes carry all necessary information to | 
|---|
|  | 819 | compile the code. | 
|---|
|  | 820 |  | 
|---|
|  | 821 | To start, I did not realize that the `mutex_opt` notation was a keyword, I | 
|---|
|  | 822 | thought it was a type annotation. I think this could be called out more | 
|---|
|  | 823 | explicitly. | 
|---|
|  | 824 |  | 
|---|
|  | 825 | Fixed, indicated "mutex" is a C-style parameter-only declaration type-qualifier. | 
|---|
|  | 826 |  | 
|---|
|  | 827 | Later, in section 5.2, the paper discusses `nomutex` annotations, which | 
|---|
|  | 828 | initially threw me, as they had not been introduced (now I realize that | 
|---|
|  | 829 | this paragraph is there to justify why there is no such keyword). The | 
|---|
|  | 830 | paragraph might be rearranged to make that clearer, perhaps by leading with | 
|---|
|  | 831 | the choice that Cforall made. | 
|---|
|  | 832 |  | 
|---|
|  | 833 | Fixed, rewrote paragraph removing nomutex. | 
|---|
|  | 834 |  | 
|---|
|  | 835 | On page 17, the paper states that "acquiring multiple monitors is safe from | 
|---|
|  | 836 | deadlock", but this could be stated a bit more precisely: acquiring | 
|---|
|  | 837 | multiple monitors in a bulk-acquire is safe from deadlock (deadlock can | 
|---|
|  | 838 | still result from nested acquires). | 
|---|
|  | 839 |  | 
|---|
|  | 840 | Fixed. | 
|---|
|  | 841 |  | 
|---|
|  | 842 | On page 18, the paper states that wait states do not have to be enclosed in | 
|---|
|  | 843 | loops, as there is no concern of barging. This seems true but there are | 
|---|
|  | 844 | also other reasons to use loops (e.g., if there are multiple reasons to | 
|---|
|  | 845 | notify on the same condition). Thus the statement initially surprised me, | 
|---|
|  | 846 | as barging is only one of many reasons that I typically employ loops around | 
|---|
|  | 847 | waits. | 
|---|
|  | 848 |  | 
|---|
|  | 849 | Fixed. Rewrote the sentence. Note, for all non-barging cases where you employ a | 
|---|
|  | 850 | loop around a wait, the unblocking task must change state before blocking | 
|---|
|  | 851 | again.  In the barging case, the unblocking thread blocks again without | 
|---|
|  | 852 | changing state. | 
|---|
|  | 853 |  | 
|---|
|  | 854 | I did not understand the diagram in Figure 12 for some time. Initially, I | 
|---|
|  | 855 | thought that it was generic to all monitors, and I could not understand the | 
|---|
|  | 856 | state space. It was only later that I realized it was specific to your | 
|---|
|  | 857 | example. Updating the caption from "Monitor scheduling to "Monitor | 
|---|
|  | 858 | scheduling in the example from Fig 13" might have helped me quite a bit. | 
|---|
|  | 859 |  | 
|---|
|  | 860 | Fixed, updated text to clarify. Did not change the caption because the | 
|---|
|  | 861 | signal_block does not apply to Figure 13. | 
|---|
|  | 862 |  | 
|---|
|  | 863 | I spent quite some time reading the boy/girl dating example (\*) and I | 
|---|
|  | 864 | admit I found it somewhat confusing. For example, I couldn't tell whether | 
|---|
|  | 865 | there were supposed to be many "girl" threads executing at once, or if | 
|---|
|  | 866 | there was only supposed to be one girl and one boy thread executing in a | 
|---|
|  | 867 | loop. | 
|---|
|  | 868 |  | 
|---|
|  | 869 | The paper states: | 
|---|
|  | 870 |  | 
|---|
|  | 871 | The dating service matches girl and boy threads with matching compatibility | 
|---|
|  | 872 | codes so they can exchange phone numbers. | 
|---|
|  | 873 |  | 
|---|
|  | 874 | so there are many girl/boy threads. There is nothing preventing an individual | 
|---|
|  | 875 | girl/boy from arranging multiple dates. | 
|---|
|  | 876 |  | 
|---|
|  | 877 | Are the girl/boy threads supposed to invoke the girl/boy methods or vice | 
|---|
|  | 878 | versa? | 
|---|
|  | 879 |  | 
|---|
|  | 880 | As long as the girls/boys are consistent in the calls, it does not matter. The | 
|---|
|  | 881 | goal is to find a partner and exchange phone numbers. | 
|---|
|  | 882 |  | 
|---|
|  | 883 | Surely there is some easier way to set this up? | 
|---|
|  | 884 |  | 
|---|
|  | 885 | There are some other solutions using monitors but they all have a similar | 
|---|
|  | 886 | structure. | 
|---|
|  | 887 |  | 
|---|
|  | 888 | The paper offered a number of comparisons to Go, C#, Scala, and so forth, | 
|---|
|  | 889 | but seems to have overlooked another recent language, Rust. In many ways, | 
|---|
|  | 890 | Rust seems to be closest in philosophy to Cforall, so it seems like an odd | 
|---|
|  | 891 | omission. I already mentioned above that Rust is in the process of shipping | 
|---|
|  | 892 | [async-await syntax][aa], which is definitely an alternative to the | 
|---|
|  | 893 | generator/coroutine approach in Cforall (though one with clear pros/cons). | 
|---|
|  | 894 |  | 
|---|
|  | 895 | We cannot get rust async-await example programs to compile nor does the select! | 
|---|
|  | 896 | macro compile. | 
|---|
|  | 897 |  | 
|---|
|  | 898 | @plg2[1]% rustc --version | 
|---|
|  | 899 | rustc 1.40.0 (73528e339 2019-12-16) | 
|---|
|  | 900 |  | 
|---|
|  | 901 | @plg2[2]% cat future.rs | 
|---|
|  | 902 | use futures::executor::block_on; | 
|---|
|  | 903 |  | 
|---|
|  | 904 | async fn hello_world() { | 
|---|
|  | 905 | println!("hello, world!"); | 
|---|
|  | 906 | } | 
|---|
|  | 907 |  | 
|---|
|  | 908 | fn main() { | 
|---|
|  | 909 | let future = hello_world(); // Nothing is printed | 
|---|
|  | 910 | block_on(future); // `future` is run and "hello, world!" is printed | 
|---|
|  | 911 | } | 
|---|
|  | 912 |  | 
|---|
|  | 913 | @plg2[3]% rustc -C opt-level=3 future.rs | 
|---|
|  | 914 | error[E0670]: `async fn` is not permitted in the 2015 edition | 
|---|
|  | 915 | --> future.rs:3:1 | 
|---|
|  | 916 | | | 
|---|
|  | 917 | 3 | async fn hello_world() { | 
|---|
|  | 918 | | ^^^^^ | 
|---|
|  | 919 |  | 
|---|
|  | 920 | error[E0433]: failed to resolve: maybe a missing crate `futures`? | 
|---|
|  | 921 | --> future.rs:1:5 | 
|---|
|  | 922 | | | 
|---|
|  | 923 | 1 | use futures::executor::block_on; | 
|---|
|  | 924 | |     ^^^^^^^ maybe a missing crate `futures`? | 
|---|
|  | 925 |  | 
|---|
|  | 926 | error[E0425]: cannot find function `block_on` in this scope | 
|---|
|  | 927 | --> future.rs:9:5 | 
|---|
|  | 928 | | | 
|---|
|  | 929 | 9 |     block_on(future); // `future` is run and "hello, world!" is printed | 
|---|
|  | 930 | |     ^^^^^^^^ not found in this scope | 
|---|
|  | 931 |  | 
|---|
|  | 932 | error: aborting due to 3 previous errors | 
|---|
|  | 933 |  | 
|---|
|  | 934 | Some errors have detailed explanations: E0425, E0433, E0670. | 
|---|
|  | 935 | For more information about an error, try `rustc --explain E0425`. | 
|---|
|  | 936 |  | 
|---|
|  | 937 |  | 
|---|
|  | 938 | In the performance section in particular, you might consider comparing | 
|---|
|  | 939 | against some of the Rust web servers and threading systems. | 
|---|
|  | 940 |  | 
|---|
|  | 941 | This paper is not about building web-servers. Nor are web-servers a reasonable | 
|---|
|  | 942 | benchmark for language concurrency. Web-servers are a benchmark for | 
|---|
|  | 943 | non-blocking I/O library efficiency accessed in the underlying operating | 
|---|
|  | 944 | system. Our prior work on web-server performance: | 
|---|
|  | 945 |  | 
|---|
|  | 946 | @inproceedings{Pariag07, | 
|---|
|  | 947 | author      = {David Pariag and Tim Brecht and Ashif Harji and Peter Buhr and Amol Shukla}, | 
|---|
|  | 948 | title       = {Comparing the Performance of Web Server Architectures}, | 
|---|
|  | 949 | booktitle   = {Proceedings of the 2007 Eurosys conference}, | 
|---|
|  | 950 | month       = mar, | 
|---|
|  | 951 | year        = 2007, | 
|---|
|  | 952 | pages       = {231--243}, | 
|---|
|  | 953 | } | 
|---|
|  | 954 |  | 
|---|
|  | 955 | @inproceedings{Harji12, | 
|---|
|  | 956 | author      = {Ashif S. Harji and Peter A. Buhr and Tim Brecht}, | 
|---|
|  | 957 | title       = {Comparing High-Performance Multi-core Web-Server Architectures}, | 
|---|
|  | 958 | booktitle   = {Proceedings of the 5th Annual International Systems and Storage Conference}, | 
|---|
|  | 959 | series      = {SYSTOR '12}, | 
|---|
|  | 960 | publisher   = {ACM}, | 
|---|
|  | 961 | address     = {New York, NY, USA}, | 
|---|
|  | 962 | location    = {Haifa, Israel}, | 
|---|
|  | 963 | month       = jun, | 
|---|
|  | 964 | year        = 2012, | 
|---|
|  | 965 | articleno   = 1, | 
|---|
|  | 966 | pages       = {1:1--1:12}, | 
|---|
|  | 967 | } | 
|---|
|  | 968 |  | 
|---|
|  | 969 | shows the steps to build a high-performance web-server, which are largely | 
|---|
|  | 970 | independent of the server architecture and programing language. | 
|---|
|  | 971 |  | 
|---|
|  | 972 | It would seem worth trying to compare their "context switching" costs as | 
|---|
|  | 973 | well -- I believe both actix and tokio have a notion of threads that could | 
|---|
|  | 974 | be readily compared. | 
|---|
|  | 975 |  | 
|---|
|  | 976 | Again, context-switching speed is largely irrelevant because the amount of code | 
|---|
|  | 977 | to process an http request is large enough to push any concurrency costs into | 
|---|
|  | 978 | the background. | 
|---|
|  | 979 |  | 
|---|
|  | 980 | Another addition that might be worth considering is to compare against | 
|---|
|  | 981 | node.js promises, although I think the comparison to process creation is | 
|---|
|  | 982 | not as clean. | 
|---|
|  | 983 |  | 
|---|
|  | 984 | Done. | 
|---|