| [04b4a71] | 1 | Reviewing: 1 | 
|---|
|  | 2 |  | 
|---|
|  | 3 | I still have a couple of issues --- perhaps the largest is that it's | 
|---|
|  | 4 | still not clear at this point in the paper what some of these options | 
|---|
|  | 5 | are, or crucially how they would be used. I don't know if it's | 
|---|
|  | 6 | possible to give high-level examples or use cases to be clear about | 
|---|
|  | 7 | these up front - or if that would duplicate too much information from | 
|---|
|  | 8 | later in the paper - either way expanding out the discussion - even if | 
|---|
|  | 9 | just two a couple of sentences for each row - would help me more. | 
|---|
|  | 10 |  | 
|---|
|  | 11 | Section 2.1 is changed to address this suggestion. | 
|---|
|  | 12 |  | 
|---|
|  | 13 |  | 
|---|
|  | 14 | * 1st para section 2 begs the question: why not support each | 
|---|
|  | 15 | dimension independently, and let the programmer or library designer | 
|---|
|  | 16 | combine features? | 
|---|
|  | 17 |  | 
|---|
|  | 18 | As seen in Table 1, not all of the combinations work, and having programmers | 
|---|
|  | 19 | directly use these low-level mechanisms is error prone. Accessing these | 
|---|
|  | 20 | fundamental mechanisms through higher-level constructs has always been the | 
|---|
|  | 21 | purpose of a programming language. | 
|---|
|  | 22 |  | 
|---|
|  | 23 |  | 
|---|
|  | 24 | * Why must there "be language mechanisms to create, block/unblock, and join | 
|---|
|  | 25 | with a thread"?  There aren't in Smalltalk (although there are in the | 
|---|
|  | 26 | runtime).  Especially given in Cforall those mechanisms are *implicit* on | 
|---|
|  | 27 | thread creation and destruction? | 
|---|
|  | 28 |  | 
|---|
| [016b1eb] | 29 | Fixed, changed sentence to: | 
|---|
| [04b4a71] | 30 |  | 
|---|
| [016b1eb] | 31 | A programmer needs mechanisms to create, block and unblock, and join with a | 
|---|
|  | 32 | thread, even if these basic mechanisms are supplied indirectly through | 
|---|
|  | 33 | high-level features. | 
|---|
| [04b4a71] | 34 |  | 
|---|
|  | 35 |  | 
|---|
|  | 36 | * "Case 1 is a function that borrows storage for its state (stack | 
|---|
|  | 37 | frame/activation) and a thread from its invoker" | 
|---|
|  | 38 |  | 
|---|
|  | 39 | this much makes perfect sense to me, but I don't understand how a | 
|---|
|  | 40 | non-stateful, non-threaded function can then retain | 
|---|
|  | 41 |  | 
|---|
|  | 42 | "this state across callees, ie, function local-variables are | 
|---|
|  | 43 | retained on the stack across calls." | 
|---|
|  | 44 |  | 
|---|
|  | 45 | how can it retain function-local values *across calls* when it | 
|---|
|  | 46 | doesn't have any functional-local state? | 
|---|
|  | 47 |  | 
|---|
|  | 48 | In the following example: | 
|---|
|  | 49 |  | 
|---|
|  | 50 | void foo() { | 
|---|
|  | 51 | // local variables and code | 
|---|
|  | 52 | } | 
|---|
|  | 53 | void bar() { | 
|---|
|  | 54 | // local variables | 
|---|
|  | 55 | foo(); | 
|---|
|  | 56 | } | 
|---|
|  | 57 |  | 
|---|
|  | 58 | bar is the caller and foo is the callee. bar borrows the program stack and | 
|---|
|  | 59 | thread to make the call to foo. When foo, the callee, is executing, bar's local | 
|---|
|  | 60 | variables (state) is retained on the *borrowed* stack across the call. (Note, I | 
|---|
|  | 61 | added *borrowed* to that sentence in the paper to help clarify.)  Furthermore, | 
|---|
|  | 62 | foo's local variables are also retain on the borrowed stack. When foo and bar | 
|---|
|  | 63 | return, all of their local state is gone (not retained). This behaviour is | 
|---|
|  | 64 | standard call/return semantics in an imperative language. | 
|---|
|  | 65 |  | 
|---|
|  | 66 |  | 
|---|
|  | 67 | I'm not sure if I see two separate cases here - roughly equivalent | 
|---|
|  | 68 | to C functions without static storage, and then C functions *with* | 
|---|
|  | 69 | static storage. | 
|---|
|  | 70 |  | 
|---|
|  | 71 | Yes, but there is only one instance of the static storage across all | 
|---|
|  | 72 | activations of the C function. For generators and coroutines, each instance has | 
|---|
|  | 73 | its own state, like an object in an OO language. | 
|---|
|  | 74 |  | 
|---|
|  | 75 |  | 
|---|
|  | 76 | I assumed that was the distinction between cases 1 & 3; but perhaps the | 
|---|
|  | 77 | actual distinction is that 3 has a suspend/resume point, and so the | 
|---|
|  | 78 | "state" in figure 1 is this component of execution state (viz figs 1 & 2), | 
|---|
|  | 79 | not the state representing the cross-call variables? | 
|---|
|  | 80 |  | 
|---|
|  | 81 | So case 3 is like an object with the added ability to retain where it was last | 
|---|
|  | 82 | executing.  When a generator is resumed, the generator object (structure | 
|---|
|  | 83 | instance) is passed as an explicit reference, and within this object is the | 
|---|
|  | 84 | restart location in the generator's "main". When the generator main executes, | 
|---|
|  | 85 | it uses the borrowed stack for its local variables and any functions it calls, | 
|---|
|  | 86 | just like an object member borrows the stack for its local variables but also | 
|---|
|  | 87 | has an implicit receiver to the object state.  A generator can have static | 
|---|
|  | 88 | storage, too, which is a single instance across all generator instances of that | 
|---|
|  | 89 | type, as for static storage in an object type. All the kinds of storage are | 
|---|
| [016b1eb] | 90 | at play with semantics that is the same as in other languages. | 
|---|
| [04b4a71] | 91 |  | 
|---|
|  | 92 |  | 
|---|
|  | 93 | > but such evaluation isn't appropriate for garbage-collected or JITTed | 
|---|
|  | 94 | > languages like Java or Go. | 
|---|
|  | 95 | For JITTed languages in particular, reporting peak performance needs to | 
|---|
|  | 96 | "warm up" the JIT with a number of iterators before beginning | 
|---|
|  | 97 | measurement. Actually for JIT's its even worse: see Edd Barrett et al | 
|---|
|  | 98 | OOPSLA 2017. | 
|---|
|  | 99 |  | 
|---|
| [27125d0] | 100 | Of our testing languages, only Java and Node.js are JITTED. To ensure the Java | 
|---|
|  | 101 | test-programs correctly measured the specific feature, we consulted with Dave | 
|---|
|  | 102 | Dice at Oracle who works directly on the development of the Oracle JVM | 
|---|
|  | 103 | Just-in-Time Compiler. We modified our test programs based on his advise, and | 
|---|
|  | 104 | he validated our programs as correctly measuring the specified language | 
|---|
| [016b1eb] | 105 | feature. Dave's help is recognized in the Acknowledgment section.  Similarly, | 
|---|
|  | 106 | we verified our Node.js programs with Gregor Richards, an expert in | 
|---|
|  | 107 | just-in-time compilation for dynamic typing.  Hence, we have taken into account | 
|---|
|  | 108 | all issues related to performing benchmarks in JITTED languages.  Also, all the | 
|---|
|  | 109 | benchmark programs are publicly available for independent verification. | 
|---|
| [04b4a71] | 110 |  | 
|---|
|  | 111 |  | 
|---|
|  | 112 | * footnote A - I've looked at various other papers & the website to try to | 
|---|
|  | 113 | understand how "object-oriented" Cforall is - I'm still not sure.  This | 
|---|
|  | 114 | footnote says Cforall has "virtuals" - presumably virtual functions, | 
|---|
|  | 115 | i.e. dynamic dispatch - and inheritance: that really is OO as far as I | 
|---|
|  | 116 | (and most OO people) are concerned.  For example Haskell doesn't have | 
|---|
|  | 117 | inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System) | 
|---|
|  | 118 | or things like Cecil and Dylan are considered OO even though they have | 
|---|
|  | 119 | "multiple function parameters as receivers", lack "lexical binding between | 
|---|
|  | 120 | a structure and set of functions", and don't have explicit receiver | 
|---|
|  | 121 | invocation syntax.  Python has receiver syntax, but unlike Java or | 
|---|
|  | 122 | Smalltalk or C++, method declarations still need to have an explicit | 
|---|
|  | 123 | "self" receiver parameter.  Seems to me that Go, for example, is | 
|---|
|  | 124 | more-or-less OO with interfaces, methods, and dynamic dispatch (yes also | 
|---|
|  | 125 | and an explicit receiver syntax but that's not determinative); while Rust | 
|---|
|  | 126 | lacks dynamic dispatch built-in.  C is not OO as a language, but as you | 
|---|
|  | 127 | say given it supports function pointers with structures, it does support | 
|---|
|  | 128 | an OO programming style. | 
|---|
|  | 129 |  | 
|---|
|  | 130 | This is why I again recommend just not buying into this fight: not making | 
|---|
|  | 131 | any claims about whether Cforall is OO or is not - because as I see it, | 
|---|
|  | 132 | the rest of the paper doesn't depend on whether Cforall is OO or not. | 
|---|
|  | 133 | That said: this is just a recommendation, and I won't quibble over this | 
|---|
|  | 134 | any further. | 
|---|
|  | 135 |  | 
|---|
|  | 136 | We believe it is important to identify Cforall as a non-OO language because it | 
|---|
|  | 137 | heavily influences the syntax and semantics used to build its concurrency. | 
|---|
|  | 138 | Since many aspects of Cforall are not OO, the rest of the paper *does* depend | 
|---|
|  | 139 | on Cforall being identified as non-OO, otherwise readers would have | 
|---|
| [016b1eb] | 140 | significantly different expectations for the design. We did not mean to suggest | 
|---|
|  | 141 | that languages that support function pointers with structures support an OO | 
|---|
|  | 142 | style. We revised the footnote to avoid this interpretation. Finally, Golang | 
|---|
|  | 143 | does not identify itself as an OO language. | 
|---|
|  | 144 | https://golang.org/doc/faq#Is_Go_an_object-oriented_language | 
|---|
| [04b4a71] | 145 |  | 
|---|
|  | 146 |  | 
|---|
|  | 147 | * is a "monitor function" the same as a "mutex function"? | 
|---|
|  | 148 | if so the paper should pick one term; if not, make the distinction clear. | 
|---|
|  | 149 |  | 
|---|
|  | 150 | Fixed. Picked "mutex". Changed the language and all places in the paper. | 
|---|
|  | 151 |  | 
|---|
|  | 152 |  | 
|---|
|  | 153 | * "As stated on line 1 because state declarations from the generator | 
|---|
|  | 154 | type can be moved out of the coroutine type into the coroutine main" | 
|---|
|  | 155 |  | 
|---|
|  | 156 | OK sure, but again: *why* would a programmer want to do that? | 
|---|
|  | 157 | (Other than, I guess, to show the difference between coroutines & | 
|---|
|  | 158 | generators?)  Perhaps another way to put this is that the first | 
|---|
|  | 159 | para of 3.2 gives the disadvantages of coroutines vs-a-vs | 
|---|
|  | 160 | generators, briefly describes the extended semantics, but never | 
|---|
|  | 161 | actually says why a programmer may want those extended semantics, | 
|---|
|  | 162 | or how they would benefit.  I don't mean to belabour the point, | 
|---|
|  | 163 | but (generalist?) readers like me would generally benefit from | 
|---|
|  | 164 | those kinds of discussions about each feature throughout the | 
|---|
|  | 165 | paper: why might a programmer want to use them? | 
|---|
|  | 166 |  | 
|---|
|  | 167 | On page 8, it states: | 
|---|
|  | 168 |  | 
|---|
|  | 169 | Having to manually create the generator closure by moving local-state | 
|---|
|  | 170 | variables into the generator type is an additional programmer burden (removed | 
|---|
|  | 171 | by the coroutine in Section 3.2). ... | 
|---|
|  | 172 |  | 
|---|
|  | 173 | also these variables can now be refactored into helper function, where the | 
|---|
|  | 174 | helper function suspends at arbitrary call depth. So imagine a coroutine helper | 
|---|
|  | 175 | function that is only called occasionally within the coroutine but it has a | 
|---|
|  | 176 | large array that is retained across suspends within the helper function. For a | 
|---|
|  | 177 | generator, this large array has to be declared in the generator type enlarging | 
|---|
|  | 178 | each generator instance even through the array is only used occasionally. | 
|---|
|  | 179 | Whereas, the coroutine only needs the array allocated when needed. Now a | 
|---|
|  | 180 | coroutine has a stack which occupies storage, but the maximum stack size only | 
|---|
| [016b1eb] | 181 | needs to be the call chain allocating the most storage, whereas the generator | 
|---|
| [04b4a71] | 182 | has a maximum size of all variable that could be created. | 
|---|
|  | 183 |  | 
|---|
|  | 184 |  | 
|---|
|  | 185 | > p17 if the multiple-monitor entry procedure really is novel, write a paper | 
|---|
|  | 186 | > about that, and only about that. | 
|---|
|  | 187 | > We do not believe this is a practical suggestion. | 
|---|
|  | 188 | * I'm honestly not trying to be snide here: I'm not an expert on monitor or | 
|---|
|  | 189 | concurrent implementations. Brinch Hansen's original monitors were single | 
|---|
|  | 190 | acquire; this draft does not cite any other previous work that I could | 
|---|
|  | 191 | see. I'm not suggesting that the brief mention of this mechanism | 
|---|
|  | 192 | necessarily be removed from this paper, but if this is novel (and a clear | 
|---|
|  | 193 | advance over a classical OO monitor a-la Java which only acquires the | 
|---|
|  | 194 | distinguished receiver) then that would be worth another paper in itself. | 
|---|
|  | 195 |  | 
|---|
|  | 196 | First, to explain multiple-monitor entry in Cforall as a separate paper would | 
|---|
|  | 197 | require significant background on Cforall concurrency, which means repeating | 
|---|
|  | 198 | large sections of this paper. Second, it feels like a paper just on | 
|---|
|  | 199 | multiple-monitor entry would be a little thin, even if the capability is novel | 
|---|
|  | 200 | to the best of our knowledge. Third, we feel multiple-monitor entry springs | 
|---|
|  | 201 | naturally from the overarching tone in the paper that Cforall is a non-OO | 
|---|
|  | 202 | programming language, allowing multiple-mutex receivers. | 
|---|
|  | 203 |  | 
|---|
|  | 204 |  | 
|---|
|  | 205 | My typo: the paper's conclusion should come at the end, after the | 
|---|
|  | 206 | future work section. | 
|---|
|  | 207 |  | 
|---|
|  | 208 | Combined into a Conclusions and Future Work section. | 
|---|
|  | 209 |  | 
|---|
|  | 210 |  | 
|---|
|  | 211 |  | 
|---|
|  | 212 | Reviewing: 2 | 
|---|
|  | 213 |  | 
|---|
|  | 214 | on the Boehm paper and whether code is "all sequential to the compiler": I | 
|---|
|  | 215 | now understand the authors' position better and suspect we are in violent | 
|---|
|  | 216 | agreement, except for whether it's appropriate to use the rather breezy | 
|---|
|  | 217 | phrase "all sequential to the compiler". It would be straightforward to | 
|---|
|  | 218 | clarify that code not using the atomics features is optimized *as if* it | 
|---|
|  | 219 | were sequential, i.e. on the assumption of a lack of data races. | 
|---|
|  | 220 |  | 
|---|
|  | 221 | Fixed, "as inline and library code is compiled as sequential without any | 
|---|
|  | 222 | explicit concurrent directive." | 
|---|
|  | 223 |  | 
|---|
|  | 224 |  | 
|---|
|  | 225 | on the distinction between "mutual exclusion" and "synchronization": the | 
|---|
|  | 226 | added citation does help, in that it makes a coherent case for the | 
|---|
|  | 227 | definition the authors prefer. However, the text could usefully clarify | 
|---|
|  | 228 | that this is a matter of definition not of fact, given especially that in | 
|---|
|  | 229 | my assessment the authors' preferred definition is not the most common | 
|---|
|  | 230 | one. (Although the mention of Hoare's apparent use of this definition is | 
|---|
|  | 231 | one data point, countervailing ones are found in many contemporaneous or | 
|---|
|  | 232 | later papers, e.g. Habermann's 1972 "Synchronization of Communicating | 
|---|
|  | 233 | Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with | 
|---|
|  | 234 | eventcounts and sequencers" (CACM (22(2)) and so on.) | 
|---|
|  | 235 |  | 
|---|
|  | 236 | Fixed, "We contend these two properties are independent, ...". | 
|---|
|  | 237 |  | 
|---|
|  | 238 | With respect to the two papers, Habermann fundamentally agrees with our | 
|---|
|  | 239 | definitions, where the term mutual exclusion is the same but Habermann uses the | 
|---|
|  | 240 | term "communication" for our synchronization. However, the term "communication" | 
|---|
|  | 241 | is rarely used to mean synchronization. The fact that Habermann collectively | 
|---|
|  | 242 | calls these two mechanisms synchronization is the confusion.  Reed & Kanodia | 
|---|
|  | 243 | state: | 
|---|
|  | 244 |  | 
|---|
|  | 245 | By mutual exclusion we mean any mechanism that forces the time ordering of | 
|---|
|  | 246 | execution of pieces of code, called critical regions, in a system of | 
|---|
|  | 247 | concurrent processes to be a total ordering. | 
|---|
|  | 248 |  | 
|---|
|  | 249 | But there is no timing order for a critical region (which I assume means | 
|---|
|  | 250 | critical section); threads can arrive at any time in any order, where the | 
|---|
|  | 251 | mutual exclusion for the critical section ensures one thread executes at a | 
|---|
| [27125d0] | 252 | time. Interestingly, Reed & Kanodia's critical region is Habermann's | 
|---|
|  | 253 | communication not critical section. These papers only buttress our contention | 
|---|
| [04b4a71] | 254 | about the confusion of these terms in the literature. | 
|---|
|  | 255 |  | 
|---|
|  | 256 |  | 
|---|
|  | 257 | section 2 (an expanded version of what was previously section 5.9) lacks | 
|---|
|  | 258 | examples and is generally obscure and allusory ("the most advanced feature" | 
|---|
|  | 259 | -- name it! "in triplets" -- there is only one triplet!; | 
|---|
|  | 260 |  | 
|---|
|  | 261 | Fixed. | 
|---|
|  | 262 |  | 
|---|
|  | 263 | These independent properties can be used to compose different language | 
|---|
|  | 264 | features, forming a compositional hierarchy, where the combination of all | 
|---|
|  | 265 | three is the most advanced feature, called a thread/task/process. While it is | 
|---|
|  | 266 | possible for a language to only provide threads for composing | 
|---|
|  | 267 | programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient | 
|---|
|  | 268 | solutions to certain classes of problems. | 
|---|
|  | 269 |  | 
|---|
|  | 270 |  | 
|---|
|  | 271 | what are "execution locations"? "initialize" and "de-initialize" | 
|---|
|  | 272 |  | 
|---|
|  | 273 | Fixed through an example at the end of the sentence. | 
|---|
|  | 274 |  | 
|---|
|  | 275 | \item[\newterm{execution state}:] is the state information needed by a | 
|---|
| [016b1eb] | 276 | control-flow feature to initialize and manage both compute data and execution | 
|---|
|  | 277 | location(s), and de-initialize. For example, calling a function initializes a | 
|---|
|  | 278 | stack frame including contained objects with constructors, manages local data | 
|---|
|  | 279 | in blocks and return locations during calls, and de-initializes the frame by | 
|---|
| [04b4a71] | 280 | running any object destructors and management operations. | 
|---|
|  | 281 |  | 
|---|
|  | 282 |  | 
|---|
|  | 283 | what? "borrowed from the invoker" is a concept in need of explaining or at | 
|---|
|  | 284 | least a fully explained example -- in what sense does a plain function | 
|---|
|  | 285 | borrow" its stack frame? | 
|---|
|  | 286 |  | 
|---|
|  | 287 | A function has no storage except for static storage; it gets its storage from | 
|---|
|  | 288 | the program stack. For a function to run it must borrow storage from somewhere. | 
|---|
|  | 289 | When the function returns, the borrowed storage is returned. Do you have a more | 
|---|
|  | 290 | appropriate word? | 
|---|
|  | 291 |  | 
|---|
| [016b1eb] | 292 |  | 
|---|
| [04b4a71] | 293 | "computation only" as opposed to what? | 
|---|
|  | 294 |  | 
|---|
|  | 295 | This is a term in concurrency for operations that compute without blocking, | 
|---|
|  | 296 | i.e., the operation starts with everything it needs to compute its result and | 
|---|
|  | 297 | runs to completion, blocking only when it is done and returns its result. | 
|---|
| [016b1eb] | 298 | Computation only occurs in "embarrassingly parallel" problems. | 
|---|
| [04b4a71] | 299 |  | 
|---|
|  | 300 |  | 
|---|
|  | 301 | in 2.2, in what way is a "request" fundamental to "synchronization"? | 
|---|
|  | 302 |  | 
|---|
|  | 303 | I assume you are referring to the last bullet. | 
|---|
|  | 304 |  | 
|---|
|  | 305 | Synchronization must be able to control the service order of requests | 
|---|
|  | 306 | including prioritizing selection from different kinds of outstanding requests, | 
|---|
|  | 307 | and postponing a request for an unspecified time while continuing to accept | 
|---|
|  | 308 | new requests. | 
|---|
|  | 309 |  | 
|---|
|  | 310 | Habermann states on page 173 | 
|---|
|  | 311 |  | 
|---|
|  | 312 | Looking at deposit we see that sender and receiver should be synchronized with | 
|---|
|  | 313 | respect to buffer overflow: deposit of another message must be delayed if | 
|---|
|  | 314 | there is no empty message frame, and such delay should be removed when the | 
|---|
|  | 315 | first empty frame becomes available.  This is programmed as | 
|---|
|  | 316 | synchronization(frame) : deposit is preceded by wait(frame), accept is | 
|---|
|  | 317 | followed by signal( frame), and the constant C[frame] is set equal to | 
|---|
|  | 318 | "bufsize." | 
|---|
|  | 319 |  | 
|---|
|  | 320 | Here synchronization is controlling the service order of requests: when the | 
|---|
|  | 321 | buffer is full, requests for insert (sender) are postponed until a request to | 
|---|
|  | 322 | remove (receiver) occurs.  Without the ability to control service order among | 
|---|
|  | 323 | requests, the producer/consumer problem with a bounded buffer cannot be | 
|---|
|  | 324 | solved. Hence, this capability is fundamental. | 
|---|
|  | 325 |  | 
|---|
|  | 326 |  | 
|---|
|  | 327 | and the "implicitly" versus "explicitly" point needs stating as elsewhere, | 
|---|
|  | 328 | with a concrete example e.g. Java built-in mutexes versus | 
|---|
|  | 329 | java.util.concurrent). | 
|---|
|  | 330 |  | 
|---|
|  | 331 | Fixed. | 
|---|
|  | 332 |  | 
|---|
|  | 333 | MES must be available implicitly in language constructs, \eg Java built-in | 
|---|
|  | 334 | monitors, as well as explicitly for specialized requirements, \eg | 
|---|
|  | 335 | @java.util.concurrent@, because requiring programmers to build MES using | 
|---|
|  | 336 | low-level locks often leads to incorrect programs. | 
|---|
|  | 337 |  | 
|---|
|  | 338 |  | 
|---|
|  | 339 | section 6: 6.2 omits the most important facts in preference for otherwise | 
|---|
|  | 340 | inscrutable detail: "identify the kind of parameter" (first say *that there | 
|---|
|  | 341 | are* kinds of parameter, and what "kinds" means!); "mutex parameters are | 
|---|
|  | 342 | documentation" is misleading (they are also semantically significant!) and | 
|---|
|  | 343 | fails to say *what* they mean; the most important thing is surely that | 
|---|
|  | 344 | 'mutex' is a language feature for performing lock/unlock operations at | 
|---|
|  | 345 | function entry/exit. So say it! | 
|---|
|  | 346 |  | 
|---|
|  | 347 | These sections have been rewritten address the comments. | 
|---|
|  | 348 |  | 
|---|
|  | 349 |  | 
|---|
|  | 350 | The meanings of examples f3 and f4 remain unclear. | 
|---|
|  | 351 |  | 
|---|
|  | 352 | Rewrote the paragraph. | 
|---|
|  | 353 |  | 
|---|
|  | 354 |  | 
|---|
|  | 355 | Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its | 
|---|
|  | 356 | meaning from Figure 12, | 
|---|
|  | 357 |  | 
|---|
|  | 358 | Defined Hoare's urgent list at the start of the paragraph. | 
|---|
|  | 359 |  | 
|---|
|  | 360 |  | 
|---|
|  | 361 | but that Figure is incomprehensible to me), and we are told of "external | 
|---|
|  | 362 | scheduling"'s long history in Ada but not clearly what it actually means; | 
|---|
|  | 363 |  | 
|---|
|  | 364 | We do not know how to address your comment because the description is clear to | 
|---|
|  | 365 | us and other non-reviewers who have read the paper. I had forgotten an | 
|---|
|  | 366 | important citation to prior work on this topic, which is now referenced: | 
|---|
|  | 367 |  | 
|---|
|  | 368 | Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor | 
|---|
|  | 369 | Classification. ACM Computing Surveys. 1995; 27(1):63-107. | 
|---|
|  | 370 |  | 
|---|
|  | 371 | This citation is a 45 page paper expanding on the topic of internal scheduling. | 
|---|
|  | 372 | I also added a citation to Hoare's monitor paper where signal_block is defined: | 
|---|
|  | 373 |  | 
|---|
|  | 374 | When a process signals a condition on which another process is waiting, the | 
|---|
|  | 375 | signalling process must wait until the resumed process permits it to | 
|---|
|  | 376 | proceed. | 
|---|
|  | 377 |  | 
|---|
|  | 378 | External scheduling from Ada was subsequently added to uC++ and Cforall. | 
|---|
|  | 379 | Furthermore, Figure 20 shows a direct comparison of CFA procedure-call | 
|---|
|  | 380 | external-scheduling with Go channel external-scheduling. So external scheduling | 
|---|
|  | 381 | exists in languages beyond Ada. | 
|---|
|  | 382 |  | 
|---|
|  | 383 |  | 
|---|
|  | 384 | 6.4's description of "waitfor" tells us it is different from an if-else | 
|---|
|  | 385 | chain but tries to use two *different* inputs to tell us that the behavior | 
|---|
|  | 386 | is different; tell us an instance where *the same* values of C1 and C2 give | 
|---|
|  | 387 | different behavior (I even wrote out a truth table and still don't see the | 
|---|
|  | 388 | semantic difference) | 
|---|
|  | 389 |  | 
|---|
|  | 390 | Again, it is unclear what is the problem. For the if-statement, if C1 is true, | 
|---|
|  | 391 | it only waits for a call to mem1, even if C2 is true and there is an | 
|---|
|  | 392 | outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there | 
|---|
|  | 393 | is a call to mem2 but not mem1, it accepts the call to mem2, which cannot | 
|---|
|  | 394 | happen with the if-statement. So for all true when clauses, any outstanding | 
|---|
|  | 395 | call is immediately accepted. If there are no outstanding calls, the waitfor | 
|---|
|  | 396 | blocks until the next call to any of the true when clauses occurs. I added the | 
|---|
|  | 397 | following sentence to further clarify. | 
|---|
|  | 398 |  | 
|---|
|  | 399 | Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause. | 
|---|
|  | 400 |  | 
|---|
|  | 401 | Note, the same parallel semantics exists for the Go select statement with | 
|---|
|  | 402 | respect to waiting for a set of channels to receive data. While Go select does | 
|---|
|  | 403 | not have a when clause, it would be trivial to add it, making the select more | 
|---|
|  | 404 | expressive. | 
|---|
|  | 405 |  | 
|---|
|  | 406 |  | 
|---|
|  | 407 | The authors frequently use bracketed phrases, and sometimes slashes "/", in | 
|---|
|  | 408 | ways that are confusing and/or detrimental to readability.  Page 13 line | 
|---|
|  | 409 | 2's "forward (backward)" is one particularly egregious example.  In general | 
|---|
|  | 410 | I would recommend the the authors try to limit their use of parentheses and | 
|---|
|  | 411 | slashes as a means of forcing a clearer wording to emerge. | 
|---|
|  | 412 |  | 
|---|
|  | 413 | Many of the slashes and parentheticals have been removed. Some are retained to | 
|---|
|  | 414 | express joined concepts: call/return, suspend/resume, resume/resume, I/O. | 
|---|
|  | 415 |  | 
|---|
|  | 416 |  | 
|---|
|  | 417 | Also, the use of "eg." is often cursory and does not explain the examples | 
|---|
|  | 418 | given, which are frequently a one- or two-word phrase of unclear referent. | 
|---|
|  | 419 |  | 
|---|
|  | 420 | A few of these are fixed. | 
|---|
|  | 421 |  | 
|---|
|  | 422 |  | 
|---|
|  | 423 | Considering the revision more broadly, none of the more extensive or | 
|---|
|  | 424 | creative rewrites I suggested in my previous review have been attempted, | 
|---|
|  | 425 | nor any equivalent efforts to improve its readability. | 
|---|
|  | 426 |  | 
|---|
|  | 427 | If you reread the previous response, we addressed all of your suggestions except | 
|---|
|  | 428 |  | 
|---|
|  | 429 | An expositional idea occurs: start the paper with a strawman | 
|---|
|  | 430 | naive/limited realisation of coroutines -- say, Simon Tatham's popular | 
|---|
|  | 431 | "Coroutines in C" web page -- and identify point by point what the | 
|---|
|  | 432 | limitations are and how C\/ overcomes them. Currently the presentation | 
|---|
|  | 433 | is often flat (lacking motivating contrasts) and backwards (stating | 
|---|
|  | 434 | solutions before problems). The foregoing approach might fix both of | 
|---|
|  | 435 | these. | 
|---|
|  | 436 |  | 
|---|
|  | 437 | We prefer the current structure of our paper and believe the paper does | 
|---|
|  | 438 | explain basic coding limitations and how they are overcome in using | 
|---|
|  | 439 | high-level control-floe mechanisms. | 
|---|
|  | 440 |  | 
|---|
|  | 441 | and we have addressed readability issues in this version. | 
|---|
|  | 442 |  | 
|---|
|  | 443 |  | 
|---|
|  | 444 | The hoisting of the former section 5.9 is a good idea, but the newly added | 
|---|
|  | 445 | material accompanying it (around Table 1) suffers fresh deficiencies in | 
|---|
|  | 446 | clarity. Overall the paper is longer than before, even though (as my | 
|---|
|  | 447 | previous review stated), I believe a shorter paper is required in order to | 
|---|
|  | 448 | serve the likely purpose of publication. (Indeed, the authors' letter | 
|---|
|  | 449 | implies that a key goal of publication is to build community and gain | 
|---|
|  | 450 | external users.) | 
|---|
|  | 451 |  | 
|---|
|  | 452 | This comment is the referee's opinion, which we do not agree with it. Our | 
|---|
|  | 453 | previous 35 page SP&E paper on Cforall: | 
|---|
|  | 454 |  | 
|---|
|  | 455 | Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern | 
|---|
|  | 456 | Programming Language Features to C. Softw. Pract. Exper. 2018; | 
|---|
|  | 457 | 48(12):2111-2146. | 
|---|
|  | 458 |  | 
|---|
|  | 459 | has a similar structure and style to this paper, and it received an award from | 
|---|
|  | 460 | John Wiley & Sons: | 
|---|
|  | 461 |  | 
|---|
|  | 462 | Software: Practice & Experience for articles published between January 2017 | 
|---|
|  | 463 | and December 2018, "most downloads in the 12 months following online | 
|---|
|  | 464 | publication showing the work generated immediate impact and visibility, | 
|---|
|  | 465 | contributing significantly to advancement in the field". | 
|---|
|  | 466 |  | 
|---|
|  | 467 | So we have demonstrated an ability to build community and gain external users. | 
|---|
|  | 468 |  | 
|---|
|  | 469 |  | 
|---|
|  | 470 | Given this trajectory, I no longer see a path to an acceptable revision of | 
|---|
|  | 471 | the present submission. Instead I suggest the authors consider splitting | 
|---|
|  | 472 | the paper in two: one half about coroutines and stack management, the other | 
|---|
|  | 473 | about mutexes, monitors and the runtime. (A briefer presentation of the | 
|---|
|  | 474 | runtime may be helpful in the first paper also, and a brief recap of the | 
|---|
|  | 475 | generator and coroutine support is obviously needed in the second too.) | 
|---|
|  | 476 |  | 
|---|
|  | 477 | Again we disagree with the referee's suggestion to vastly restructure the | 
|---|
|  | 478 | paper. What advantage is there is presenting exactly the same material across | 
|---|
|  | 479 | two papers, which will likely end up longer than a single paper? The current | 
|---|
|  | 480 | paper has a clear theme that fundamental execution properties generate a set of | 
|---|
|  | 481 | basic language mechanisms, and we then proceed to show how these mechanisms can | 
|---|
|  | 482 | be designed into the programing language Cforall. | 
|---|
|  | 483 |  | 
|---|
|  | 484 |  | 
|---|
|  | 485 | I do not buy the authors' defense of the limited practical experience or | 
|---|
|  | 486 | "non-micro" benchmarking presented. Yes, gaining external users is hard and | 
|---|
|  | 487 | I am sympathetic on that point. But building something at least *somewhat* | 
|---|
|  | 488 | substantial with your own system should be within reach, and without it the | 
|---|
|  | 489 | "practice and experience" aspects of the work have not been explored. | 
|---|
|  | 490 | Clearly C\/ is the product of a lot of work over an extended period, so it | 
|---|
|  | 491 | is a surprise that no such experience is readily available for inclusion. | 
|---|
|  | 492 |  | 
|---|
|  | 493 | Understood. There are no agreed-upon concurrency benchmarks, which is why | 
|---|
|  | 494 | micro-benchmarks are often used. Currently, the entire Cforall runtime is | 
|---|
|  | 495 | written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to | 
|---|
|  | 496 | be thread safe, automatically detects the use of concurrency features at link | 
|---|
|  | 497 | time, and bootstraps into a threaded runtime almost immediately at program | 
|---|
|  | 498 | startup so threads can be declared as global variables and may run to | 
|---|
|  | 499 | completion before the program main starts. The concurrent core of the runtime | 
|---|
|  | 500 | is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks | 
|---|
|  | 501 | and high-level features. In other words, the concurrent core uses itself as | 
|---|
|  | 502 | quickly as possible to start using high-level concurrency. There are 12,000+ | 
|---|
|  | 503 | LOC in the Cforall tests-suite used to verify language features, which are run | 
|---|
|  | 504 | nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such | 
|---|
|  | 505 | as aggressively testing each language feature, and classical examples such as | 
|---|
|  | 506 | bounded buffer, dating service, matrix summation, quickSort, binary insertion | 
|---|
|  | 507 | sort, etc.  More experience will be available soon, based on ongoing work in | 
|---|
|  | 508 | the "future works" section. Specifically, non-blocking I/O is working with the | 
|---|
|  | 509 | new Linux io_uring interface and a new high-performance ready-queue is under | 
|---|
|  | 510 | construction to take into account this change. With non-blocking I/O, it will | 
|---|
|  | 511 | be possible to write applications like high-performance web servers, as is now | 
|---|
|  | 512 | done in Rust and Go. Also, completed is Java-style executors for work-based | 
|---|
|  | 513 | concurrent programming and futures. Under construction is a high-performance | 
|---|
|  | 514 | actor system. | 
|---|
|  | 515 |  | 
|---|
|  | 516 |  | 
|---|
|  | 517 | It does not seem right to state that a stack is essential to Von Neumann | 
|---|
|  | 518 | architectures -- since the earliest Von Neumann machines (and indeed early | 
|---|
|  | 519 | Fortran) did not use one. | 
|---|
|  | 520 |  | 
|---|
|  | 521 | Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2 | 
|---|
|  | 522 | https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf | 
|---|
|  | 523 |  | 
|---|
|  | 524 | Since a subprogram may call for other subprograms to any desired depth, a | 
|---|
|  | 525 | particular CALL statement may be defined by a pyramid of multi-level | 
|---|
|  | 526 | subprograms. | 
|---|
|  | 527 |  | 
|---|
|  | 528 | I think we may be differing on the meaning of stack. You may be imagining a | 
|---|
| [27125d0] | 529 | modern stack that grows and shrinks dynamically. Whereas early Fortran | 
|---|
| [04b4a71] | 530 | preallocated a stack frame for each function, like Python allocates a frame for | 
|---|
|  | 531 | a generator.  Within each preallocated Fortran function is a frame for local | 
|---|
|  | 532 | variables and a pointer to store the return value for a call.  The Fortran | 
|---|
| [27125d0] | 533 | call/return mechanism uses these frames to build a traditional call stack | 
|---|
| [04b4a71] | 534 | linked by the return pointer. The only restriction is that a function stack | 
|---|
|  | 535 | frame can only be used once, implying no direct or indirect recursion.  Hence, | 
|---|
|  | 536 | without a stack mechanism, there can be no call/return to "any desired depth", | 
|---|
|  | 537 | where the maximum desired depth is limited by the number of functions. So | 
|---|
|  | 538 | call/return requires some form of a stack, virtually all programming language | 
|---|
|  | 539 | have call/return, past and present, and these languages run on Von Neumann | 
|---|
|  | 540 | machines that do not distinguish between program and memory space, have mutable | 
|---|
|  | 541 | state, and the concept of a pointer to data or code. | 
|---|
|  | 542 |  | 
|---|
|  | 543 |  | 
|---|
|  | 544 | To elaborate on something another reviewer commented on: it is a surprise | 
|---|
|  | 545 | to find a "Future work" section *after* the "Conclusion" section. A | 
|---|
|  | 546 | "Conclusions and future work" section often works well. | 
|---|
|  | 547 |  | 
|---|
|  | 548 | Done. | 
|---|
|  | 549 |  | 
|---|
|  | 550 |  | 
|---|
|  | 551 |  | 
|---|
|  | 552 | Reviewing: 3 | 
|---|
|  | 553 |  | 
|---|
|  | 554 | but it remains really difficult to have a good sense of which idea I should | 
|---|
|  | 555 | use and when. This applies in different ways to different features from the | 
|---|
|  | 556 | language: | 
|---|
|  | 557 |  | 
|---|
|  | 558 | * coroutines/generators/threads: here there is some discussion, but it can | 
|---|
|  | 559 | be improved. | 
|---|
| [016b1eb] | 560 | * internal/external scheduling: I didn't find any direct comparison between | 
|---|
| [04b4a71] | 561 | these features, except by way of example. | 
|---|
|  | 562 |  | 
|---|
|  | 563 | See changes below. | 
|---|
|  | 564 |  | 
|---|
|  | 565 |  | 
|---|
|  | 566 | I would have preferred something more like a table or a few paragraphs | 
|---|
|  | 567 | highlighting the key reasons one would pick one construct or the other. | 
|---|
|  | 568 |  | 
|---|
|  | 569 | Section 2.1 is changed to address this suggestion. | 
|---|
|  | 570 |  | 
|---|
|  | 571 |  | 
|---|
|  | 572 | The discussion of clusters and pre-emption in particular feels quite rushed. | 
|---|
|  | 573 |  | 
|---|
|  | 574 | We believe a brief introduction to the Cforall runtime structure is important | 
|---|
|  | 575 | because clustering within a user-level versus distributed system is unusual, | 
|---|
|  | 576 | Furthermore, the explanation of preemption is important because several new | 
|---|
|  | 577 | languages, like Go and Rust tokio, are not preemptive. Rust threads are | 
|---|
|  | 578 | preemptive only because it uses kernel threads, which UNIX preempts. | 
|---|
|  | 579 |  | 
|---|
|  | 580 |  | 
|---|
|  | 581 | * Recommend to shorten the comparison on coroutine/generator/threads in | 
|---|
|  | 582 | Section 2 to a paragraph with a few examples, or possibly a table | 
|---|
|  | 583 | explaining the trade-offs between the constructs | 
|---|
|  | 584 |  | 
|---|
|  | 585 | Not done, see below. | 
|---|
|  | 586 |  | 
|---|
|  | 587 |  | 
|---|
|  | 588 | * Recommend to clarify the relationship between internal/external | 
|---|
|  | 589 | scheduling -- is one more general but more error-prone or low-level? | 
|---|
|  | 590 |  | 
|---|
|  | 591 | Done, see below. | 
|---|
|  | 592 |  | 
|---|
|  | 593 |  | 
|---|
|  | 594 | There is obviously a lot of overlap between these features, and in | 
|---|
|  | 595 | particular between coroutines and generators. As noted in the previous | 
|---|
|  | 596 | review, many languages have chosen to offer *only* generators, and to | 
|---|
|  | 597 | build coroutines by stacks of generators invoking one another. | 
|---|
|  | 598 |  | 
|---|
| [27125d0] | 599 | Coroutines built from stacks of generators have problems, such as no symmetric | 
|---|
|  | 600 | control-flow. Furthermore, stacks of generators have a problem with the | 
|---|
|  | 601 | following programming pattern.  logger is a library function called from a | 
|---|
|  | 602 | function or a coroutine, where the doit for the coroutine suspends. With stacks | 
|---|
|  | 603 | of generators, there has to be a function and generator version of logger to | 
|---|
|  | 604 | support these two scenarios. If logger is a library function, it may be | 
|---|
|  | 605 | impossible to create the generator logger because the logger function is | 
|---|
| [04b4a71] | 606 | opaque. | 
|---|
|  | 607 |  | 
|---|
|  | 608 | #include <fstream.hfa> | 
|---|
|  | 609 | #include <coroutine.hfa> | 
|---|
|  | 610 |  | 
|---|
|  | 611 | forall( otype T | { void doit( T ); } ) | 
|---|
|  | 612 | void logger( T & t ) { | 
|---|
|  | 613 | doit( t ); | 
|---|
|  | 614 | } | 
|---|
|  | 615 |  | 
|---|
|  | 616 | coroutine C {}; | 
|---|
|  | 617 | void main( C & c ) with( c ) { | 
|---|
|  | 618 | void doit( C & c ) { suspend; } | 
|---|
|  | 619 | logger( c ); | 
|---|
|  | 620 | } | 
|---|
|  | 621 | void mem( C & c ) { | 
|---|
|  | 622 | resume( c ); | 
|---|
|  | 623 | } | 
|---|
|  | 624 |  | 
|---|
|  | 625 | int main() { | 
|---|
|  | 626 | C c; | 
|---|
|  | 627 | mem( c ); | 
|---|
|  | 628 | mem( c ); | 
|---|
|  | 629 |  | 
|---|
|  | 630 | struct S {}; | 
|---|
|  | 631 | S s; | 
|---|
|  | 632 | void doit( S & s ) {} | 
|---|
|  | 633 | logger( s ); | 
|---|
|  | 634 | } | 
|---|
|  | 635 |  | 
|---|
| [016b1eb] | 636 | Additional text has been added to the start of Section 3.2 addressing this | 
|---|
|  | 637 | comment. | 
|---|
| [04b4a71] | 638 |  | 
|---|
|  | 639 |  | 
|---|
|  | 640 | In fact, the end of Section 2.1 (on page 5) contains a particular paragraph | 
|---|
|  | 641 | that embodies this "top down" approach. It starts, "programmers can now | 
|---|
|  | 642 | answer three basic questions", and thus gives some practical advice for | 
|---|
|  | 643 | which construct you should use and when. I think giving some examples of | 
|---|
|  | 644 | specific applications that this paragraph, combined with some examples of | 
|---|
|  | 645 | cases where each construct was needed, would be a better approach. | 
|---|
|  | 646 |  | 
|---|
|  | 647 | I don't think this comparison needs to be very long. It seems clear enough | 
|---|
|  | 648 | that one would | 
|---|
|  | 649 |  | 
|---|
|  | 650 | * prefer generators for simple computations that yield up many values, | 
|---|
|  | 651 | * prefer coroutines for more complex processes that have significant | 
|---|
|  | 652 | internal structure, | 
|---|
|  | 653 |  | 
|---|
| [27125d0] | 654 | We do not believe the number of values yielded is an important factor is | 
|---|
|  | 655 | choosing between a generator or coroutine; either form can receive (input) or | 
|---|
|  | 656 | return (output) millions of values. As well, simple versus complex computations | 
|---|
| [b54118a] | 657 | is also not an absolute criterion for selection as both forms can be simple or | 
|---|
|  | 658 | complex. | 
|---|
|  | 659 |  | 
|---|
|  | 660 | As stated in the paper, a coroutine brings generality to the implementation | 
|---|
|  | 661 | because of the addition stack, whereas generators have restrictions on standard | 
|---|
|  | 662 | software-engineering practices: variable placement, no helper functions without | 
|---|
|  | 663 | creating an explicit generator stack, and no symmetric control-flow.  It is | 
|---|
|  | 664 | true that simple generators probably suffer less from restrictions. Also, | 
|---|
|  | 665 | simplicity means the amount of work done is small so the cost of the | 
|---|
|  | 666 | suspend/resume becomes a factor, where experiments show suspend/resume for a | 
|---|
|  | 667 | generator is 10 times faster than for a coroutines. | 
|---|
|  | 668 |  | 
|---|
| [04b4a71] | 669 |  | 
|---|
|  | 670 | * prefer threads for cases where parallel execution is desired or needed. | 
|---|
|  | 671 |  | 
|---|
|  | 672 | Agreed, but this description does not mention mutual exclusion and | 
|---|
|  | 673 | synchronization, which is essential in any meaningful concurrent program. | 
|---|
|  | 674 |  | 
|---|
|  | 675 | Our point here is to illustrate that a casual "top down" explanation is | 
|---|
|  | 676 | insufficient to explain the complexity of the underlying execution properties. | 
|---|
|  | 677 | We presented some rule-of-thumbs at the end of Section 2 but programmers must | 
|---|
|  | 678 | understand all the underlying mechanisms and their interactions to exploit the | 
|---|
|  | 679 | execution properties to their fullest, and to understand when a programming | 
|---|
|  | 680 | language does or does not provide a desired mechanism. | 
|---|
|  | 681 |  | 
|---|
|  | 682 |  | 
|---|
|  | 683 | I did appreciate the comparison in Section 2.3 between async-await in | 
|---|
|  | 684 | JS/Java and generators/coroutines. I agree with its premise that those | 
|---|
|  | 685 | mechanisms are a poor replacement for generators (and, indeed, JS has a | 
|---|
|  | 686 | distinct generator mechanism, for example, in part for this reason).  I | 
|---|
|  | 687 | believe I may have asked for this in a previous review, but having read it, | 
|---|
|  | 688 | I wonder if it is really necessary, since those mechanisms are so different | 
|---|
|  | 689 | in purpose. | 
|---|
|  | 690 |  | 
|---|
| [27125d0] | 691 | Given that you asked about this before, I believe other readers might also ask | 
|---|
|  | 692 | the same question because async-await is very popular. So I think this section | 
|---|
|  | 693 | does help to position the work in the paper among other work, and hence, it is | 
|---|
|  | 694 | appropriate to keep it in the paper. | 
|---|
| [04b4a71] | 695 |  | 
|---|
|  | 696 |  | 
|---|
|  | 697 | I find the motivation for supporting both internal and external scheduling | 
|---|
|  | 698 | to be fairly implicit. After several reads through the section, I came to | 
|---|
|  | 699 | the conclusion that internal scheduling is more expressive than external | 
|---|
|  | 700 | scheduling, but sometimes less convenient or clear. Is this correct? If | 
|---|
|  | 701 | not, it'd be useful to clarify where external scheduling is more | 
|---|
|  | 702 | expressive. | 
|---|
|  | 703 |  | 
|---|
|  | 704 | I would find it very interesting to try and capture some of the properties | 
|---|
|  | 705 | that make internal vs external scheduling the better choice. | 
|---|
|  | 706 |  | 
|---|
|  | 707 | For example, it seems to me that external scheduling works well if there | 
|---|
|  | 708 | are only a few "key" operations, but that internal scheduling might be | 
|---|
|  | 709 | better otherwise, simply because it would be useful to have the ability to | 
|---|
|  | 710 | name a signal that can be referenced by many methods. | 
|---|
|  | 711 |  | 
|---|
|  | 712 | To address this point, the last paragraph on page 22 (now page 23) has been | 
|---|
|  | 713 | augmented to the following: | 
|---|
|  | 714 |  | 
|---|
|  | 715 | Given external and internal scheduling, what guidelines can a programmer use | 
|---|
|  | 716 | to select between them?  In general, external scheduling is easier to | 
|---|
|  | 717 | understand and code because only the next logical action (mutex function(s)) | 
|---|
|  | 718 | is stated, and the monitor implicitly handles all the details.  Therefore, | 
|---|
|  | 719 | there are no condition variables, and hence, no wait and signal, which reduces | 
|---|
|  | 720 | coding complexity and synchronization errors.  If external scheduling is | 
|---|
|  | 721 | simpler than internal, why not use it all the time?  Unfortunately, external | 
|---|
|  | 722 | scheduling cannot be used if: scheduling depends on parameter value(s) or | 
|---|
|  | 723 | scheduling must block across an unknown series of calls on a condition | 
|---|
|  | 724 | variable (\ie internal scheduling).  For example, the dating service cannot be | 
|---|
|  | 725 | written using external scheduling.  First, scheduling requires knowledge of | 
|---|
|  | 726 | calling parameters to make matching decisions and parameters of calling | 
|---|
|  | 727 | threads are unavailable within the monitor.  Specifically, a thread within the | 
|---|
|  | 728 | monitor cannot examine the @ccode@ of threads waiting on the calling queue to | 
|---|
|  | 729 | determine if there is a matching partner.  (Similarly, if the bounded buffer | 
|---|
|  | 730 | or readers/writer are restructured with a single interface function with a | 
|---|
|  | 731 | parameter denoting producer/consumer or reader/write, they cannot be solved | 
|---|
|  | 732 | with external scheduling.)  Second, a scheduling decision may be delayed | 
|---|
|  | 733 | across an unknown number of calls when there is no immediate match so the | 
|---|
|  | 734 | thread in the monitor must block on a condition.  Specifically, if a thread | 
|---|
|  | 735 | determines there is no opposite calling thread with the same @ccode@, it must | 
|---|
|  | 736 | wait an unknown period until a matching thread arrives.  For complex | 
|---|
|  | 737 | synchronization, both external and internal scheduling can be used to take | 
|---|
|  | 738 | advantage of best of properties of each. | 
|---|
|  | 739 |  | 
|---|
|  | 740 |  | 
|---|
|  | 741 | Consider the bounded buffer from Figure 13: if it had multiple methods for | 
|---|
|  | 742 | removing elements, and not just `remove`, then the `waitfor(remove)` call | 
|---|
|  | 743 | in `insert` might not be sufficient. | 
|---|
|  | 744 |  | 
|---|
|  | 745 | Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle | 
|---|
|  | 746 | your request: | 
|---|
|  | 747 |  | 
|---|
|  | 748 | waitfor( remove : buffer ); or waitfor( remove2 : buffer ); | 
|---|
|  | 749 |  | 
|---|
|  | 750 | and its shorthand form (not shown in the paper) | 
|---|
|  | 751 |  | 
|---|
| [016b1eb] | 752 | waitfor( remove, remove2 : buffer ); | 
|---|
| [04b4a71] | 753 |  | 
|---|
| [016b1eb] | 754 | A call to either of these remove functions satisfies the waitfor (exact | 
|---|
|  | 755 | selection details are discussed in Section 6.4). | 
|---|
| [04b4a71] | 756 |  | 
|---|
|  | 757 |  | 
|---|
|  | 758 | The same is true, I think, of the `signal_block` function, which I | 
|---|
|  | 759 | have not encountered before; | 
|---|
|  | 760 |  | 
|---|
|  | 761 | In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System | 
|---|
|  | 762 | Structuring Concept", it states on page 551: | 
|---|
|  | 763 |  | 
|---|
|  | 764 | When a process signals a condition on which another process is waiting, the | 
|---|
|  | 765 | signalling process must wait until the resumed process permits it to | 
|---|
|  | 766 | proceed. We therefore introduce for each monitor a second semaphore "urgent" | 
|---|
|  | 767 | (initialized to 0), on which signalling processes suspend themselves by the | 
|---|
|  | 768 | operation P(urgent). | 
|---|
|  | 769 |  | 
|---|
|  | 770 | Hence, the original definition of signal is in fact signal_block, i.e., the | 
|---|
|  | 771 | signaller blocks. Later implementations of monitor switched to signaller | 
|---|
|  | 772 | nonblocking because most signals occur before returns, which allows the | 
|---|
|  | 773 | signaller to continue execution, exit the monitor, and run concurrently with | 
|---|
|  | 774 | the signalled thread that restarts in the monitor. When the signaller is not | 
|---|
|  | 775 | going to exit immediately, signal_block is appropriate. | 
|---|
|  | 776 |  | 
|---|
|  | 777 |  | 
|---|
|  | 778 | it seems like its behavior can be modeled with multiple condition | 
|---|
|  | 779 | variables, but that's clearly more complex. | 
|---|
|  | 780 |  | 
|---|
|  | 781 | Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing | 
|---|
|  | 782 | Surveys, 27(1):63-107, that all extant monitors with different signalling | 
|---|
|  | 783 | semantics can be transformed into each other. However, some transformations are | 
|---|
|  | 784 | complex and runtime expensive. | 
|---|
|  | 785 |  | 
|---|
|  | 786 |  | 
|---|
|  | 787 | One question I had about `signal_block`: what happens if one signals | 
|---|
|  | 788 | but no other thread is waiting? Does it block until some other thread | 
|---|
|  | 789 | waits? Or is that user error? | 
|---|
|  | 790 |  | 
|---|
|  | 791 | On page 20, it states: | 
|---|
|  | 792 |  | 
|---|
|  | 793 | Signalling is unconditional because signalling an empty condition queue does | 
|---|
|  | 794 | nothing. | 
|---|
|  | 795 |  | 
|---|
|  | 796 | To the best of our knowledge, all monitors have the same semantics for | 
|---|
|  | 797 | signalling an empty condition queue, regardless of the kind of signal, i.e., | 
|---|
|  | 798 | signal or signal_block. | 
|---|
|  | 799 |  | 
|---|
| [016b1eb] | 800 |  | 
|---|
| [04b4a71] | 801 | I believe that one difference between the Go program and the Cforall | 
|---|
|  | 802 | equivalent is that the Goroutine has an associated queue, so that | 
|---|
|  | 803 | multiple messages could be enqueued, whereas the Cforall equivalent is | 
|---|
|  | 804 | effectively a "bounded buffer" of length 1. Is that correct? | 
|---|
|  | 805 |  | 
|---|
|  | 806 | Actually, the buffer length is 0 for the Cforall call and the Go unbuffered | 
|---|
|  | 807 | send so both are synchronous communication. | 
|---|
|  | 808 |  | 
|---|
| [016b1eb] | 809 |  | 
|---|
| [04b4a71] | 810 | I think this should be stated explicitly. (Presumably, one could modify the | 
|---|
|  | 811 | Cforall program to include an explicit vector of queued messages if | 
|---|
|  | 812 | desired, but you would also be reimplementing the channel abstraction.) | 
|---|
|  | 813 |  | 
|---|
|  | 814 | Fixed, by adding the following sentences: | 
|---|
|  | 815 |  | 
|---|
|  | 816 | The different between call and channel send occurs for buffered channels | 
|---|
|  | 817 | making the send asynchronous.  In \CFA, asynchronous call and multiple | 
|---|
|  | 818 | buffers is provided using an administrator and worker | 
|---|
|  | 819 | threads~\cite{Gentleman81} and/or futures (not discussed). | 
|---|
|  | 820 |  | 
|---|
|  | 821 |  | 
|---|
|  | 822 | Also, in Figure 20, I believe that there is a missing `mutex` keyword. | 
|---|
|  | 823 |  | 
|---|
|  | 824 | Fixed. | 
|---|
|  | 825 |  | 
|---|
|  | 826 |  | 
|---|
|  | 827 | I was glad to see that the paper acknowledged that Cforall still had | 
|---|
|  | 828 | low-level atomic operations, even if their use is discouraged in favor of | 
|---|
|  | 829 | higher-level alternatives. | 
|---|
|  | 830 |  | 
|---|
|  | 831 | There was never an attempt to not acknowledged that Cforall had low-level | 
|---|
|  | 832 | atomic operations. The original version of the paper stated: | 
|---|
|  | 833 |  | 
|---|
|  | 834 | 6.6 Low-level Locks | 
|---|
|  | 835 | For completeness and efficiency, Cforall provides a standard set of low-level | 
|---|
|  | 836 | locks: recursive mutex, condition, semaphore, barrier, etc., and atomic | 
|---|
|  | 837 | instructions: fetchAssign, fetchAdd, testSet, compareSet, etc. | 
|---|
|  | 838 |  | 
|---|
|  | 839 | and that section is still in the paper. In fact, we use these low-level | 
|---|
|  | 840 | mechanisms to build all of the high-level concurrency constructs in Cforall. | 
|---|
|  | 841 |  | 
|---|
|  | 842 |  | 
|---|
|  | 843 | However, I still feel that the conclusion overstates the value of the | 
|---|
|  | 844 | contribution here when it says that "Cforall high-level race-free monitors | 
|---|
|  | 845 | and threads provide the core mechanisms for mutual exclusion and | 
|---|
|  | 846 | synchronization, without the need for volatile and atomics". I feel | 
|---|
|  | 847 | confident that Java programmers, for example, would be advised to stick | 
|---|
|  | 848 | with synchronized methods whenever possible, and it seems to me that they | 
|---|
|  | 849 | offer similar advantages -- but they sometimes wind up using volatiles for | 
|---|
|  | 850 | performance reasons. | 
|---|
|  | 851 |  | 
|---|
|  | 852 | I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent | 
|---|
|  | 853 | programs can achieve very good performance without volatile or atomics because | 
|---|
|  | 854 | the runtime system has already used these low-level capabilities to build an | 
|---|
|  | 855 | efficient set of high-level concurrency constructs. | 
|---|
|  | 856 |  | 
|---|
|  | 857 | 0.1% of the time programmers need to build their own locks and synchronization | 
|---|
|  | 858 | mechanisms. This need also occurs for storage management. Both of these | 
|---|
|  | 859 | mechanisms are allowed in Cforall but are fraught with danger and should be | 
|---|
|  | 860 | discouraged. Specially, it takes a 7th dan Black Belt programmer to understand | 
|---|
|  | 861 | fencing for a WSO memory model, such as on the ARM. It doesn't help that the | 
|---|
|  | 862 | C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea, | 
|---|
|  | 863 | Dave Dice and me would agree that 99% of hand-crafted locks created by | 
|---|
|  | 864 | programmers are broken and/or non-portable. | 
|---|
|  | 865 |  | 
|---|
|  | 866 |  | 
|---|
|  | 867 | I was also confused by the term "race-free" in that sentence. In | 
|---|
|  | 868 | particular, I don't think that Cforall has any mechanisms for preventing | 
|---|
|  | 869 | *data races*, and it clearly doesn't prevent "race conditions" (which would | 
|---|
|  | 870 | bar all sorts of useful programs). I suppose that "race free" here might be | 
|---|
|  | 871 | referring to the improvements such as removing barging behavior. | 
|---|
|  | 872 |  | 
|---|
|  | 873 | We use the term "race free" to mean the same as Boehm/Adve's "data-race | 
|---|
|  | 874 | freedom" in | 
|---|
|  | 875 |  | 
|---|
|  | 876 | Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or | 
|---|
|  | 877 | Memory Models. Communications ACM. 2012; 55(2):48-54. | 
|---|
|  | 878 | https://queue.acm.org/detail.cfm?id=2088916 | 
|---|
|  | 879 |  | 
|---|
|  | 880 | which is cited in the paper. Furthermore, we never said that Cforall has | 
|---|
|  | 881 | mechanisms for preventing *all* data races. We said Cforall high-level | 
|---|
|  | 882 | race-free monitors and threads[, when used with mutex access function] (added | 
|---|
|  | 883 | to the paper), implies no data races within these constructs, unless a | 
|---|
|  | 884 | programmer directly publishes shared state. This approach is exactly what | 
|---|
|  | 885 | Boehm/Adve advocate for the vast majority of concurrent programming. | 
|---|
|  | 886 |  | 
|---|
|  | 887 |  | 
|---|
|  | 888 | It would perhaps be more interesting to see a comparison built using [tokio] or | 
|---|
|  | 889 | [async-std], two of the more prominent user-space threading libraries that | 
|---|
|  | 890 | build on Rust's async-await feature (which operates quite differently than | 
|---|
|  | 891 | Javascript's async-await, in that it doesn't cause every aync function call to | 
|---|
|  | 892 | schedule a distinct task). | 
|---|
|  | 893 |  | 
|---|
|  | 894 | Done. | 
|---|
|  | 895 |  | 
|---|
|  | 896 |  | 
|---|
|  | 897 | Several figures used the `with` keyword. I deduced that `with(foo)` permits | 
|---|
|  | 898 | one to write `bar` instead of `foo.bar`. It seems worth introducing. | 
|---|
|  | 899 | Apologies if this is stated in the paper, if so I missed it. | 
|---|
|  | 900 |  | 
|---|
|  | 901 | Page 6, footnote F states: | 
|---|
|  | 902 |  | 
|---|
|  | 903 | The Cforall "with" clause opens an aggregate scope making its fields directly | 
|---|
|  | 904 | accessible, like Pascal "with", but using parallel semantics; multiple | 
|---|
|  | 905 | aggregates may be opened. | 
|---|
|  | 906 |  | 
|---|
|  | 907 |  | 
|---|
|  | 908 | On page 20, section 6.3, "external scheduling and vice versus" should be | 
|---|
|  | 909 | "external scheduling and vice versa". | 
|---|
|  | 910 |  | 
|---|
|  | 911 | Fixed. | 
|---|
|  | 912 |  | 
|---|
|  | 913 |  | 
|---|
|  | 914 | On page 5, section 2.3, the paper states "we content" but it should be "we | 
|---|
|  | 915 | contend". | 
|---|
|  | 916 |  | 
|---|
|  | 917 | Fixed. | 
|---|
|  | 918 |  | 
|---|
|  | 919 |  | 
|---|
|  | 920 | Page 1. I don't believe that it s fair to imply that Scala is "research | 
|---|
|  | 921 | vehicle" as it is used by major players, Twitter being the most prominent | 
|---|
|  | 922 | example. | 
|---|
|  | 923 |  | 
|---|
|  | 924 | Fixed. Removed "research". | 
|---|
|  | 925 |  | 
|---|
|  | 926 |  | 
|---|
|  | 927 | Page 15. Must Cforall threads start after construction (e.g. see your | 
|---|
|  | 928 | example on page 15, line 21)? | 
|---|
|  | 929 |  | 
|---|
|  | 930 | Yes. Our experience in Java, uC++ and Cforall is that 95% of the time | 
|---|
|  | 931 | programmers want threads to start immediately. (Most Java programs have no code | 
|---|
|  | 932 | between a thread declaration and the call to start the thread.)  Therefore, | 
|---|
|  | 933 | this semantic should be the default because (see page 13): | 
|---|
|  | 934 |  | 
|---|
|  | 935 | Alternatives, such as explicitly starting threads as in Java, are repetitive | 
|---|
|  | 936 | and forgetting to call start is a common source of errors. | 
|---|
|  | 937 |  | 
|---|
|  | 938 | To handle the other 5% of the cases, there is a trivial Cforall pattern | 
|---|
| [27125d0] | 939 | providing Java-style start/join. The additional cost for this pattern is small | 
|---|
|  | 940 | in comparison to the work performed between the start and join. | 
|---|
| [04b4a71] | 941 |  | 
|---|
|  | 942 | thread T {}; | 
|---|
|  | 943 | void start( T & mutex ) {} // any function name | 
|---|
|  | 944 | void join( T & mutex ) {} // any function name | 
|---|
|  | 945 | void main( T & t ) { | 
|---|
|  | 946 | sout | "start"; | 
|---|
|  | 947 | waitfor( start : t ); // wait to be started | 
|---|
|  | 948 | sout | "restart"; // perform work | 
|---|
|  | 949 | waitfor( join : t ); // wait to be joined | 
|---|
|  | 950 | sout | "join"; | 
|---|
|  | 951 | } | 
|---|
| [016b1eb] | 952 |  | 
|---|
| [04b4a71] | 953 | int main() { | 
|---|
|  | 954 | T t[3]; // threads start and delay | 
|---|
|  | 955 | sout | "need to start"; | 
|---|
|  | 956 | for ( i; 3 ) start( t[i] ); | 
|---|
|  | 957 | sout | "need to join"; | 
|---|
|  | 958 | for ( i; 3 ) join( t[i] ); | 
|---|
|  | 959 | sout | "threads stopped"; | 
|---|
|  | 960 | } // threads deleted from stack | 
|---|
|  | 961 |  | 
|---|
|  | 962 | $ a.out | 
|---|
|  | 963 | need to start | 
|---|
|  | 964 | start | 
|---|
|  | 965 | start | 
|---|
|  | 966 | start | 
|---|
|  | 967 | need to join | 
|---|
|  | 968 | restart | 
|---|
|  | 969 | restart | 
|---|
|  | 970 | restart | 
|---|
|  | 971 | threads stopped | 
|---|
|  | 972 | join | 
|---|
|  | 973 | join | 
|---|
|  | 974 | join | 
|---|
|  | 975 |  | 
|---|
|  | 976 |  | 
|---|
|  | 977 | Page 18, line 17: is using | 
|---|
|  | 978 |  | 
|---|
|  | 979 | Fixed. | 
|---|