| 1 | Reviewing: 1
 | 
|---|
| 2 | 
 | 
|---|
| 3 |     I still have a couple of issues --- perhaps the largest is that it's
 | 
|---|
| 4 |     still not clear at this point in the paper what some of these options
 | 
|---|
| 5 |     are, or crucially how they would be used. I don't know if it's
 | 
|---|
| 6 |     possible to give high-level examples or use cases to be clear about
 | 
|---|
| 7 |     these up front - or if that would duplicate too much information from
 | 
|---|
| 8 |     later in the paper - either way expanding out the discussion - even if
 | 
|---|
| 9 |     just two a couple of sentences for each row - would help me more.
 | 
|---|
| 10 | 
 | 
|---|
| 11 | Section 2.1 is changed to address this suggestion.
 | 
|---|
| 12 | 
 | 
|---|
| 13 | 
 | 
|---|
| 14 |     * 1st para section 2 begs the question: why not support each
 | 
|---|
| 15 |       dimension independently, and let the programmer or library designer
 | 
|---|
| 16 |       combine features?
 | 
|---|
| 17 | 
 | 
|---|
| 18 | As seen in Table 1, not all of the combinations work, and having programmers
 | 
|---|
| 19 | directly use these low-level mechanisms is error prone. Accessing these
 | 
|---|
| 20 | fundamental mechanisms through higher-level constructs has always been the
 | 
|---|
| 21 | purpose of a programming language.
 | 
|---|
| 22 | 
 | 
|---|
| 23 | 
 | 
|---|
| 24 |     * Why must there "be language mechanisms to create, block/unblock, and join
 | 
|---|
| 25 |       with a thread"?  There aren't in Smalltalk (although there are in the
 | 
|---|
| 26 |       runtime).  Especially given in Cforall those mechanisms are *implicit* on
 | 
|---|
| 27 |       thread creation and destruction?
 | 
|---|
| 28 | 
 | 
|---|
| 29 | Fixed, changed sentence to:
 | 
|---|
| 30 | 
 | 
|---|
| 31 |  A programmer needs mechanisms to create, block and unblock, and join with a
 | 
|---|
| 32 |  thread, even if these basic mechanisms are supplied indirectly through
 | 
|---|
| 33 |  high-level features.
 | 
|---|
| 34 | 
 | 
|---|
| 35 | 
 | 
|---|
| 36 |    * "Case 1 is a function that borrows storage for its state (stack
 | 
|---|
| 37 |      frame/activation) and a thread from its invoker"
 | 
|---|
| 38 |   
 | 
|---|
| 39 |      this much makes perfect sense to me, but I don't understand how a
 | 
|---|
| 40 |      non-stateful, non-threaded function can then retain
 | 
|---|
| 41 |   
 | 
|---|
| 42 |      "this state across callees, ie, function local-variables are
 | 
|---|
| 43 |      retained on the stack across calls."
 | 
|---|
| 44 |   
 | 
|---|
| 45 |      how can it retain function-local values *across calls* when it
 | 
|---|
| 46 |      doesn't have any functional-local state?
 | 
|---|
| 47 | 
 | 
|---|
| 48 | In the following example:
 | 
|---|
| 49 | 
 | 
|---|
| 50 |   void foo() {
 | 
|---|
| 51 |      // local variables and code
 | 
|---|
| 52 |   }
 | 
|---|
| 53 |   void bar() {
 | 
|---|
| 54 |      // local variables
 | 
|---|
| 55 |      foo();
 | 
|---|
| 56 |   }
 | 
|---|
| 57 | 
 | 
|---|
| 58 | bar is the caller and foo is the callee. bar borrows the program stack and
 | 
|---|
| 59 | thread to make the call to foo. When foo, the callee, is executing, bar's local
 | 
|---|
| 60 | variables (state) is retained on the *borrowed* stack across the call. (Note, I
 | 
|---|
| 61 | added *borrowed* to that sentence in the paper to help clarify.)  Furthermore,
 | 
|---|
| 62 | foo's local variables are also retain on the borrowed stack. When foo and bar
 | 
|---|
| 63 | return, all of their local state is gone (not retained). This behaviour is
 | 
|---|
| 64 | standard call/return semantics in an imperative language.
 | 
|---|
| 65 | 
 | 
|---|
| 66 | 
 | 
|---|
| 67 |      I'm not sure if I see two separate cases here - roughly equivalent
 | 
|---|
| 68 |      to C functions without static storage, and then C functions *with*
 | 
|---|
| 69 |      static storage.
 | 
|---|
| 70 | 
 | 
|---|
| 71 | Yes, but there is only one instance of the static storage across all
 | 
|---|
| 72 | activations of the C function. For generators and coroutines, each instance has
 | 
|---|
| 73 | its own state, like an object in an OO language.
 | 
|---|
| 74 | 
 | 
|---|
| 75 | 
 | 
|---|
| 76 |      I assumed that was the distinction between cases 1 & 3; but perhaps the
 | 
|---|
| 77 |      actual distinction is that 3 has a suspend/resume point, and so the
 | 
|---|
| 78 |      "state" in figure 1 is this component of execution state (viz figs 1 & 2),
 | 
|---|
| 79 |      not the state representing the cross-call variables?
 | 
|---|
| 80 | 
 | 
|---|
| 81 | So case 3 is like an object with the added ability to retain where it was last
 | 
|---|
| 82 | executing.  When a generator is resumed, the generator object (structure
 | 
|---|
| 83 | instance) is passed as an explicit reference, and within this object is the
 | 
|---|
| 84 | restart location in the generator's "main". When the generator main executes,
 | 
|---|
| 85 | it uses the borrowed stack for its local variables and any functions it calls,
 | 
|---|
| 86 | just like an object member borrows the stack for its local variables but also
 | 
|---|
| 87 | has an implicit receiver to the object state.  A generator can have static
 | 
|---|
| 88 | storage, too, which is a single instance across all generator instances of that
 | 
|---|
| 89 | type, as for static storage in an object type. All the kinds of storage are
 | 
|---|
| 90 | at play with semantics that is the same as in other languages.
 | 
|---|
| 91 | 
 | 
|---|
| 92 | 
 | 
|---|
| 93 |     > but such evaluation isn't appropriate for garbage-collected or JITTed
 | 
|---|
| 94 |     > languages like Java or Go.
 | 
|---|
| 95 |     For JITTed languages in particular, reporting peak performance needs to
 | 
|---|
| 96 |     "warm up" the JIT with a number of iterators before beginning
 | 
|---|
| 97 |     measurement. Actually for JIT's its even worse: see Edd Barrett et al
 | 
|---|
| 98 |     OOPSLA 2017.
 | 
|---|
| 99 |    
 | 
|---|
| 100 | Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
 | 
|---|
| 101 | test-programs correctly measured the specific feature, we consulted with Dave
 | 
|---|
| 102 | Dice at Oracle who works directly on the development of the Oracle JVM
 | 
|---|
| 103 | Just-in-Time Compiler. We modified our test programs based on his advise, and
 | 
|---|
| 104 | he validated our programs as correctly measuring the specified language
 | 
|---|
| 105 | feature. Dave's help is recognized in the Acknowledgment section.  Similarly,
 | 
|---|
| 106 | we verified our Node.js programs with Gregor Richards, an expert in
 | 
|---|
| 107 | just-in-time compilation for dynamic typing.  Hence, we have taken into account
 | 
|---|
| 108 | all issues related to performing benchmarks in JITTED languages.  Also, all the
 | 
|---|
| 109 | benchmark programs are publicly available for independent verification.
 | 
|---|
| 110 | 
 | 
|---|
| 111 | 
 | 
|---|
| 112 |    * footnote A - I've looked at various other papers & the website to try to
 | 
|---|
| 113 |      understand how "object-oriented" Cforall is - I'm still not sure.  This
 | 
|---|
| 114 |      footnote says Cforall has "virtuals" - presumably virtual functions,
 | 
|---|
| 115 |      i.e. dynamic dispatch - and inheritance: that really is OO as far as I
 | 
|---|
| 116 |      (and most OO people) are concerned.  For example Haskell doesn't have
 | 
|---|
| 117 |      inheritance, so it's not OO; while CLOS (the Common Lisp *Object* System)
 | 
|---|
| 118 |      or things like Cecil and Dylan are considered OO even though they have
 | 
|---|
| 119 |      "multiple function parameters as receivers", lack "lexical binding between
 | 
|---|
| 120 |      a structure and set of functions", and don't have explicit receiver
 | 
|---|
| 121 |      invocation syntax.  Python has receiver syntax, but unlike Java or
 | 
|---|
| 122 |      Smalltalk or C++, method declarations still need to have an explicit
 | 
|---|
| 123 |      "self" receiver parameter.  Seems to me that Go, for example, is
 | 
|---|
| 124 |      more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
 | 
|---|
| 125 |      and an explicit receiver syntax but that's not determinative); while Rust
 | 
|---|
| 126 |      lacks dynamic dispatch built-in.  C is not OO as a language, but as you
 | 
|---|
| 127 |      say given it supports function pointers with structures, it does support
 | 
|---|
| 128 |      an OO programming style.
 | 
|---|
| 129 |    
 | 
|---|
| 130 |      This is why I again recommend just not buying into this fight: not making
 | 
|---|
| 131 |      any claims about whether Cforall is OO or is not - because as I see it,
 | 
|---|
| 132 |      the rest of the paper doesn't depend on whether Cforall is OO or not.
 | 
|---|
| 133 |      That said: this is just a recommendation, and I won't quibble over this
 | 
|---|
| 134 |      any further.
 | 
|---|
| 135 | 
 | 
|---|
| 136 | We believe it is important to identify Cforall as a non-OO language because it
 | 
|---|
| 137 | heavily influences the syntax and semantics used to build its concurrency.
 | 
|---|
| 138 | Since many aspects of Cforall are not OO, the rest of the paper *does* depend
 | 
|---|
| 139 | on Cforall being identified as non-OO, otherwise readers would have
 | 
|---|
| 140 | significantly different expectations for the design. We did not mean to suggest
 | 
|---|
| 141 | that languages that support function pointers with structures support an OO
 | 
|---|
| 142 | style. We revised the footnote to avoid this interpretation. Finally, Golang
 | 
|---|
| 143 | does not identify itself as an OO language.
 | 
|---|
| 144 | https://golang.org/doc/faq#Is_Go_an_object-oriented_language
 | 
|---|
| 145 | 
 | 
|---|
| 146 | 
 | 
|---|
| 147 |    * is a "monitor function" the same as a "mutex function"?
 | 
|---|
| 148 |      if so the paper should pick one term; if not, make the distinction clear.
 | 
|---|
| 149 | 
 | 
|---|
| 150 | Fixed. Picked "mutex". Changed the language and all places in the paper.
 | 
|---|
| 151 | 
 | 
|---|
| 152 | 
 | 
|---|
| 153 |    * "As stated on line 1 because state declarations from the generator
 | 
|---|
| 154 |       type can be moved out of the coroutine type into the coroutine main"
 | 
|---|
| 155 |   
 | 
|---|
| 156 |       OK sure, but again: *why* would a programmer want to do that?
 | 
|---|
| 157 |       (Other than, I guess, to show the difference between coroutines &
 | 
|---|
| 158 |       generators?)  Perhaps another way to put this is that the first
 | 
|---|
| 159 |       para of 3.2 gives the disadvantages of coroutines vs-a-vs
 | 
|---|
| 160 |       generators, briefly describes the extended semantics, but never
 | 
|---|
| 161 |       actually says why a programmer may want those extended semantics,
 | 
|---|
| 162 |       or how they would benefit.  I don't mean to belabour the point,
 | 
|---|
| 163 |       but (generalist?) readers like me would generally benefit from
 | 
|---|
| 164 |       those kinds of discussions about each feature throughout the
 | 
|---|
| 165 |       paper: why might a programmer want to use them?
 | 
|---|
| 166 | 
 | 
|---|
| 167 | On page 8, it states:
 | 
|---|
| 168 | 
 | 
|---|
| 169 |   Having to manually create the generator closure by moving local-state
 | 
|---|
| 170 |   variables into the generator type is an additional programmer burden (removed
 | 
|---|
| 171 |   by the coroutine in Section 3.2). ...
 | 
|---|
| 172 | 
 | 
|---|
| 173 | also these variables can now be refactored into helper function, where the
 | 
|---|
| 174 | helper function suspends at arbitrary call depth. So imagine a coroutine helper
 | 
|---|
| 175 | function that is only called occasionally within the coroutine but it has a
 | 
|---|
| 176 | large array that is retained across suspends within the helper function. For a
 | 
|---|
| 177 | generator, this large array has to be declared in the generator type enlarging
 | 
|---|
| 178 | each generator instance even through the array is only used occasionally.
 | 
|---|
| 179 | Whereas, the coroutine only needs the array allocated when needed. Now a
 | 
|---|
| 180 | coroutine has a stack which occupies storage, but the maximum stack size only
 | 
|---|
| 181 | needs to be the call chain allocating the most storage, whereas the generator
 | 
|---|
| 182 | has a maximum size of all variable that could be created.
 | 
|---|
| 183 | 
 | 
|---|
| 184 | 
 | 
|---|
| 185 |     > p17 if the multiple-monitor entry procedure really is novel, write a paper
 | 
|---|
| 186 |     > about that, and only about that.
 | 
|---|
| 187 |     > We do not believe this is a practical suggestion.
 | 
|---|
| 188 |     * I'm honestly not trying to be snide here: I'm not an expert on monitor or
 | 
|---|
| 189 |       concurrent implementations. Brinch Hansen's original monitors were single
 | 
|---|
| 190 |       acquire; this draft does not cite any other previous work that I could
 | 
|---|
| 191 |       see. I'm not suggesting that the brief mention of this mechanism
 | 
|---|
| 192 |       necessarily be removed from this paper, but if this is novel (and a clear
 | 
|---|
| 193 |       advance over a classical OO monitor a-la Java which only acquires the
 | 
|---|
| 194 |       distinguished receiver) then that would be worth another paper in itself.
 | 
|---|
| 195 | 
 | 
|---|
| 196 | First, to explain multiple-monitor entry in Cforall as a separate paper would
 | 
|---|
| 197 | require significant background on Cforall concurrency, which means repeating
 | 
|---|
| 198 | large sections of this paper. Second, it feels like a paper just on
 | 
|---|
| 199 | multiple-monitor entry would be a little thin, even if the capability is novel
 | 
|---|
| 200 | to the best of our knowledge. Third, we feel multiple-monitor entry springs
 | 
|---|
| 201 | naturally from the overarching tone in the paper that Cforall is a non-OO
 | 
|---|
| 202 | programming language, allowing multiple-mutex receivers.
 | 
|---|
| 203 | 
 | 
|---|
| 204 | 
 | 
|---|
| 205 |     My typo: the paper's conclusion should come at the end, after the
 | 
|---|
| 206 |     future work section.
 | 
|---|
| 207 | 
 | 
|---|
| 208 | Combined into a Conclusions and Future Work section.
 | 
|---|
| 209 | 
 | 
|---|
| 210 | 
 | 
|---|
| 211 | 
 | 
|---|
| 212 | Reviewing: 2
 | 
|---|
| 213 | 
 | 
|---|
| 214 |     on the Boehm paper and whether code is "all sequential to the compiler": I
 | 
|---|
| 215 |     now understand the authors' position better and suspect we are in violent
 | 
|---|
| 216 |     agreement, except for whether it's appropriate to use the rather breezy
 | 
|---|
| 217 |     phrase "all sequential to the compiler". It would be straightforward to
 | 
|---|
| 218 |     clarify that code not using the atomics features is optimized *as if* it
 | 
|---|
| 219 |     were sequential, i.e. on the assumption of a lack of data races.
 | 
|---|
| 220 | 
 | 
|---|
| 221 | Fixed, "as inline and library code is compiled as sequential without any
 | 
|---|
| 222 | explicit concurrent directive."
 | 
|---|
| 223 | 
 | 
|---|
| 224 | 
 | 
|---|
| 225 |     on the distinction between "mutual exclusion" and "synchronization": the
 | 
|---|
| 226 |     added citation does help, in that it makes a coherent case for the
 | 
|---|
| 227 |     definition the authors prefer. However, the text could usefully clarify
 | 
|---|
| 228 |     that this is a matter of definition not of fact, given especially that in
 | 
|---|
| 229 |     my assessment the authors' preferred definition is not the most common
 | 
|---|
| 230 |     one. (Although the mention of Hoare's apparent use of this definition is
 | 
|---|
| 231 |     one data point, countervailing ones are found in many contemporaneous or
 | 
|---|
| 232 |     later papers, e.g. Habermann's 1972 "Synchronization of Communicating
 | 
|---|
| 233 |     Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
 | 
|---|
| 234 |     eventcounts and sequencers" (CACM (22(2)) and so on.)
 | 
|---|
| 235 | 
 | 
|---|
| 236 | Fixed, "We contend these two properties are independent, ...".
 | 
|---|
| 237 | 
 | 
|---|
| 238 | With respect to the two papers, Habermann fundamentally agrees with our
 | 
|---|
| 239 | definitions, where the term mutual exclusion is the same but Habermann uses the
 | 
|---|
| 240 | term "communication" for our synchronization. However, the term "communication"
 | 
|---|
| 241 | is rarely used to mean synchronization. The fact that Habermann collectively
 | 
|---|
| 242 | calls these two mechanisms synchronization is the confusion.  Reed & Kanodia
 | 
|---|
| 243 | state:
 | 
|---|
| 244 | 
 | 
|---|
| 245 |   By mutual exclusion we mean any mechanism that forces the time ordering of
 | 
|---|
| 246 |   execution of pieces of code, called critical regions, in a system of
 | 
|---|
| 247 |   concurrent processes to be a total ordering.
 | 
|---|
| 248 | 
 | 
|---|
| 249 | But there is no timing order for a critical region (which I assume means
 | 
|---|
| 250 | critical section); threads can arrive at any time in any order, where the
 | 
|---|
| 251 | mutual exclusion for the critical section ensures one thread executes at a
 | 
|---|
| 252 | time. Interestingly, Reed & Kanodia's critical region is Habermann's
 | 
|---|
| 253 | communication not critical section. These papers only buttress our contention
 | 
|---|
| 254 | about the confusion of these terms in the literature.
 | 
|---|
| 255 | 
 | 
|---|
| 256 | 
 | 
|---|
| 257 |     section 2 (an expanded version of what was previously section 5.9) lacks
 | 
|---|
| 258 |     examples and is generally obscure and allusory ("the most advanced feature"
 | 
|---|
| 259 |     -- name it! "in triplets" -- there is only one triplet!;
 | 
|---|
| 260 | 
 | 
|---|
| 261 | Fixed.
 | 
|---|
| 262 | 
 | 
|---|
| 263 |  These independent properties can be used to compose different language
 | 
|---|
| 264 |  features, forming a compositional hierarchy, where the combination of all
 | 
|---|
| 265 |  three is the most advanced feature, called a thread/task/process. While it is
 | 
|---|
| 266 |  possible for a language to only provide threads for composing
 | 
|---|
| 267 |  programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
 | 
|---|
| 268 |  solutions to certain classes of problems.
 | 
|---|
| 269 | 
 | 
|---|
| 270 | 
 | 
|---|
| 271 |     what are "execution locations"? "initialize" and "de-initialize"
 | 
|---|
| 272 | 
 | 
|---|
| 273 | Fixed through an example at the end of the sentence.
 | 
|---|
| 274 | 
 | 
|---|
| 275 |  \item[\newterm{execution state}:] is the state information needed by a
 | 
|---|
| 276 |  control-flow feature to initialize and manage both compute data and execution
 | 
|---|
| 277 |  location(s), and de-initialize. For example, calling a function initializes a
 | 
|---|
| 278 |  stack frame including contained objects with constructors, manages local data
 | 
|---|
| 279 |  in blocks and return locations during calls, and de-initializes the frame by
 | 
|---|
| 280 |  running any object destructors and management operations.
 | 
|---|
| 281 | 
 | 
|---|
| 282 | 
 | 
|---|
| 283 |     what? "borrowed from the invoker" is a concept in need of explaining or at
 | 
|---|
| 284 |     least a fully explained example -- in what sense does a plain function
 | 
|---|
| 285 |     borrow" its stack frame?
 | 
|---|
| 286 | 
 | 
|---|
| 287 | A function has no storage except for static storage; it gets its storage from
 | 
|---|
| 288 | the program stack. For a function to run it must borrow storage from somewhere.
 | 
|---|
| 289 | When the function returns, the borrowed storage is returned. Do you have a more
 | 
|---|
| 290 | appropriate word?
 | 
|---|
| 291 | 
 | 
|---|
| 292 | 
 | 
|---|
| 293 |     "computation only" as opposed to what?
 | 
|---|
| 294 | 
 | 
|---|
| 295 | This is a term in concurrency for operations that compute without blocking,
 | 
|---|
| 296 | i.e., the operation starts with everything it needs to compute its result and
 | 
|---|
| 297 | runs to completion, blocking only when it is done and returns its result.
 | 
|---|
| 298 | Computation only occurs in "embarrassingly parallel" problems.
 | 
|---|
| 299 | 
 | 
|---|
| 300 | 
 | 
|---|
| 301 |     in 2.2, in what way is a "request" fundamental to "synchronization"?
 | 
|---|
| 302 | 
 | 
|---|
| 303 | I assume you are referring to the last bullet.
 | 
|---|
| 304 | 
 | 
|---|
| 305 |  Synchronization must be able to control the service order of requests
 | 
|---|
| 306 |  including prioritizing selection from different kinds of outstanding requests,
 | 
|---|
| 307 |  and postponing a request for an unspecified time while continuing to accept
 | 
|---|
| 308 |  new requests.
 | 
|---|
| 309 | 
 | 
|---|
| 310 | Habermann states on page 173
 | 
|---|
| 311 | 
 | 
|---|
| 312 |  Looking at deposit we see that sender and receiver should be synchronized with
 | 
|---|
| 313 |  respect to buffer overflow: deposit of another message must be delayed if
 | 
|---|
| 314 |  there is no empty message frame, and such delay should be removed when the
 | 
|---|
| 315 |  first empty frame becomes available.  This is programmed as
 | 
|---|
| 316 |  synchronization(frame) : deposit is preceded by wait(frame), accept is
 | 
|---|
| 317 |  followed by signal( frame), and the constant C[frame] is set equal to
 | 
|---|
| 318 |  "bufsize."
 | 
|---|
| 319 | 
 | 
|---|
| 320 | Here synchronization is controlling the service order of requests: when the
 | 
|---|
| 321 | buffer is full, requests for insert (sender) are postponed until a request to
 | 
|---|
| 322 | remove (receiver) occurs.  Without the ability to control service order among
 | 
|---|
| 323 | requests, the producer/consumer problem with a bounded buffer cannot be
 | 
|---|
| 324 | solved. Hence, this capability is fundamental.
 | 
|---|
| 325 | 
 | 
|---|
| 326 | 
 | 
|---|
| 327 |     and the "implicitly" versus "explicitly" point needs stating as elsewhere,
 | 
|---|
| 328 |     with a concrete example e.g. Java built-in mutexes versus
 | 
|---|
| 329 |     java.util.concurrent).
 | 
|---|
| 330 | 
 | 
|---|
| 331 | Fixed.
 | 
|---|
| 332 | 
 | 
|---|
| 333 |  MES must be available implicitly in language constructs, \eg Java built-in
 | 
|---|
| 334 |  monitors, as well as explicitly for specialized requirements, \eg
 | 
|---|
| 335 |  @java.util.concurrent@, because requiring programmers to build MES using
 | 
|---|
| 336 |  low-level locks often leads to incorrect programs.
 | 
|---|
| 337 | 
 | 
|---|
| 338 | 
 | 
|---|
| 339 |     section 6: 6.2 omits the most important facts in preference for otherwise
 | 
|---|
| 340 |     inscrutable detail: "identify the kind of parameter" (first say *that there
 | 
|---|
| 341 |     are* kinds of parameter, and what "kinds" means!); "mutex parameters are
 | 
|---|
| 342 |     documentation" is misleading (they are also semantically significant!) and
 | 
|---|
| 343 |     fails to say *what* they mean; the most important thing is surely that
 | 
|---|
| 344 |     'mutex' is a language feature for performing lock/unlock operations at
 | 
|---|
| 345 |     function entry/exit. So say it!
 | 
|---|
| 346 | 
 | 
|---|
| 347 | These sections have been rewritten address the comments.
 | 
|---|
| 348 | 
 | 
|---|
| 349 | 
 | 
|---|
| 350 |     The meanings of examples f3 and f4 remain unclear.
 | 
|---|
| 351 | 
 | 
|---|
| 352 | Rewrote the paragraph.
 | 
|---|
| 353 | 
 | 
|---|
| 354 | 
 | 
|---|
| 355 |     Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
 | 
|---|
| 356 |     meaning from Figure 12,
 | 
|---|
| 357 | 
 | 
|---|
| 358 | Defined Hoare's urgent list at the start of the paragraph.
 | 
|---|
| 359 | 
 | 
|---|
| 360 | 
 | 
|---|
| 361 |     but that Figure is incomprehensible to me), and we are told of "external
 | 
|---|
| 362 |     scheduling"'s long history in Ada but not clearly what it actually means;
 | 
|---|
| 363 | 
 | 
|---|
| 364 | We do not know how to address your comment because the description is clear to
 | 
|---|
| 365 | us and other non-reviewers who have read the paper. I had forgotten an
 | 
|---|
| 366 | important citation to prior work on this topic, which is now referenced:
 | 
|---|
| 367 | 
 | 
|---|
| 368 |    Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
 | 
|---|
| 369 |    Classification. ACM Computing Surveys. 1995; 27(1):63-107.
 | 
|---|
| 370 | 
 | 
|---|
| 371 | This citation is a 45 page paper expanding on the topic of internal scheduling.
 | 
|---|
| 372 | I also added a citation to Hoare's monitor paper where signal_block is defined:
 | 
|---|
| 373 | 
 | 
|---|
| 374 |   When a process signals a condition on which another process is waiting, the
 | 
|---|
| 375 |   signalling process must wait until the resumed process permits it to
 | 
|---|
| 376 |   proceed.
 | 
|---|
| 377 | 
 | 
|---|
| 378 | External scheduling from Ada was subsequently added to uC++ and Cforall.
 | 
|---|
| 379 | Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
 | 
|---|
| 380 | external-scheduling with Go channel external-scheduling. So external scheduling
 | 
|---|
| 381 | exists in languages beyond Ada.
 | 
|---|
| 382 | 
 | 
|---|
| 383 | 
 | 
|---|
| 384 |     6.4's description of "waitfor" tells us it is different from an if-else
 | 
|---|
| 385 |     chain but tries to use two *different* inputs to tell us that the behavior
 | 
|---|
| 386 |     is different; tell us an instance where *the same* values of C1 and C2 give
 | 
|---|
| 387 |     different behavior (I even wrote out a truth table and still don't see the
 | 
|---|
| 388 |     semantic difference)
 | 
|---|
| 389 | 
 | 
|---|
| 390 | Again, it is unclear what is the problem. For the if-statement, if C1 is true,
 | 
|---|
| 391 | it only waits for a call to mem1, even if C2 is true and there is an
 | 
|---|
| 392 | outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
 | 
|---|
| 393 | is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
 | 
|---|
| 394 | happen with the if-statement. So for all true when clauses, any outstanding
 | 
|---|
| 395 | call is immediately accepted. If there are no outstanding calls, the waitfor
 | 
|---|
| 396 | blocks until the next call to any of the true when clauses occurs. I added the
 | 
|---|
| 397 | following sentence to further clarify.
 | 
|---|
| 398 | 
 | 
|---|
| 399 |  Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
 | 
|---|
| 400 | 
 | 
|---|
| 401 | Note, the same parallel semantics exists for the Go select statement with
 | 
|---|
| 402 | respect to waiting for a set of channels to receive data. While Go select does
 | 
|---|
| 403 | not have a when clause, it would be trivial to add it, making the select more
 | 
|---|
| 404 | expressive.
 | 
|---|
| 405 | 
 | 
|---|
| 406 | 
 | 
|---|
| 407 |     The authors frequently use bracketed phrases, and sometimes slashes "/", in
 | 
|---|
| 408 |     ways that are confusing and/or detrimental to readability.  Page 13 line
 | 
|---|
| 409 |     2's "forward (backward)" is one particularly egregious example.  In general
 | 
|---|
| 410 |     I would recommend the the authors try to limit their use of parentheses and
 | 
|---|
| 411 |     slashes as a means of forcing a clearer wording to emerge.
 | 
|---|
| 412 | 
 | 
|---|
| 413 | Many of the slashes and parentheticals have been removed. Some are retained to
 | 
|---|
| 414 | express joined concepts: call/return, suspend/resume, resume/resume, I/O.
 | 
|---|
| 415 | 
 | 
|---|
| 416 | 
 | 
|---|
| 417 |     Also, the use of "eg." is often cursory and does not explain the examples
 | 
|---|
| 418 |     given, which are frequently a one- or two-word phrase of unclear referent.
 | 
|---|
| 419 | 
 | 
|---|
| 420 | A few of these are fixed.
 | 
|---|
| 421 | 
 | 
|---|
| 422 | 
 | 
|---|
| 423 |     Considering the revision more broadly, none of the more extensive or
 | 
|---|
| 424 |     creative rewrites I suggested in my previous review have been attempted,
 | 
|---|
| 425 |     nor any equivalent efforts to improve its readability.
 | 
|---|
| 426 | 
 | 
|---|
| 427 | If you reread the previous response, we addressed all of your suggestions except
 | 
|---|
| 428 | 
 | 
|---|
| 429 |         An expositional idea occurs: start the paper with a strawman
 | 
|---|
| 430 |         naive/limited realisation of coroutines -- say, Simon Tatham's popular
 | 
|---|
| 431 |         "Coroutines in C" web page -- and identify point by point what the
 | 
|---|
| 432 |         limitations are and how C\/ overcomes them. Currently the presentation
 | 
|---|
| 433 |         is often flat (lacking motivating contrasts) and backwards (stating
 | 
|---|
| 434 |         solutions before problems). The foregoing approach might fix both of
 | 
|---|
| 435 |         these.
 | 
|---|
| 436 |     
 | 
|---|
| 437 |     We prefer the current structure of our paper and believe the paper does
 | 
|---|
| 438 |     explain basic coding limitations and how they are overcome in using
 | 
|---|
| 439 |     high-level control-floe mechanisms.
 | 
|---|
| 440 | 
 | 
|---|
| 441 | and we have addressed readability issues in this version.
 | 
|---|
| 442 | 
 | 
|---|
| 443 | 
 | 
|---|
| 444 |     The hoisting of the former section 5.9 is a good idea, but the newly added
 | 
|---|
| 445 |     material accompanying it (around Table 1) suffers fresh deficiencies in
 | 
|---|
| 446 |     clarity. Overall the paper is longer than before, even though (as my
 | 
|---|
| 447 |     previous review stated), I believe a shorter paper is required in order to
 | 
|---|
| 448 |     serve the likely purpose of publication. (Indeed, the authors' letter
 | 
|---|
| 449 |     implies that a key goal of publication is to build community and gain
 | 
|---|
| 450 |     external users.)
 | 
|---|
| 451 | 
 | 
|---|
| 452 | This comment is the referee's opinion, which we do not agree with it. Our
 | 
|---|
| 453 | previous 35 page SP&E paper on Cforall:
 | 
|---|
| 454 | 
 | 
|---|
| 455 |   Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
 | 
|---|
| 456 |   Programming Language Features to C. Softw. Pract. Exper. 2018;
 | 
|---|
| 457 |   48(12):2111-2146.
 | 
|---|
| 458 | 
 | 
|---|
| 459 | has a similar structure and style to this paper, and it received an award from
 | 
|---|
| 460 | John Wiley & Sons:
 | 
|---|
| 461 | 
 | 
|---|
| 462 |  Software: Practice & Experience for articles published between January 2017
 | 
|---|
| 463 |  and December 2018, "most downloads in the 12 months following online
 | 
|---|
| 464 |  publication showing the work generated immediate impact and visibility,
 | 
|---|
| 465 |  contributing significantly to advancement in the field".
 | 
|---|
| 466 | 
 | 
|---|
| 467 | So we have demonstrated an ability to build community and gain external users.
 | 
|---|
| 468 | 
 | 
|---|
| 469 |   
 | 
|---|
| 470 |     Given this trajectory, I no longer see a path to an acceptable revision of
 | 
|---|
| 471 |     the present submission. Instead I suggest the authors consider splitting
 | 
|---|
| 472 |     the paper in two: one half about coroutines and stack management, the other
 | 
|---|
| 473 |     about mutexes, monitors and the runtime. (A briefer presentation of the
 | 
|---|
| 474 |     runtime may be helpful in the first paper also, and a brief recap of the
 | 
|---|
| 475 |     generator and coroutine support is obviously needed in the second too.)
 | 
|---|
| 476 | 
 | 
|---|
| 477 | Again we disagree with the referee's suggestion to vastly restructure the
 | 
|---|
| 478 | paper. What advantage is there is presenting exactly the same material across
 | 
|---|
| 479 | two papers, which will likely end up longer than a single paper? The current
 | 
|---|
| 480 | paper has a clear theme that fundamental execution properties generate a set of
 | 
|---|
| 481 | basic language mechanisms, and we then proceed to show how these mechanisms can
 | 
|---|
| 482 | be designed into the programing language Cforall.
 | 
|---|
| 483 | 
 | 
|---|
| 484 | 
 | 
|---|
| 485 |     I do not buy the authors' defense of the limited practical experience or
 | 
|---|
| 486 |     "non-micro" benchmarking presented. Yes, gaining external users is hard and
 | 
|---|
| 487 |     I am sympathetic on that point. But building something at least *somewhat*
 | 
|---|
| 488 |     substantial with your own system should be within reach, and without it the
 | 
|---|
| 489 |     "practice and experience" aspects of the work have not been explored.
 | 
|---|
| 490 |     Clearly C\/ is the product of a lot of work over an extended period, so it
 | 
|---|
| 491 |     is a surprise that no such experience is readily available for inclusion.
 | 
|---|
| 492 | 
 | 
|---|
| 493 | Understood. There are no agreed-upon concurrency benchmarks, which is why
 | 
|---|
| 494 | micro-benchmarks are often used. Currently, the entire Cforall runtime is
 | 
|---|
| 495 | written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
 | 
|---|
| 496 | be thread safe, automatically detects the use of concurrency features at link
 | 
|---|
| 497 | time, and bootstraps into a threaded runtime almost immediately at program
 | 
|---|
| 498 | startup so threads can be declared as global variables and may run to
 | 
|---|
| 499 | completion before the program main starts. The concurrent core of the runtime
 | 
|---|
| 500 | is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
 | 
|---|
| 501 | and high-level features. In other words, the concurrent core uses itself as
 | 
|---|
| 502 | quickly as possible to start using high-level concurrency. There are 12,000+
 | 
|---|
| 503 | LOC in the Cforall tests-suite used to verify language features, which are run
 | 
|---|
| 504 | nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
 | 
|---|
| 505 | as aggressively testing each language feature, and classical examples such as
 | 
|---|
| 506 | bounded buffer, dating service, matrix summation, quickSort, binary insertion
 | 
|---|
| 507 | sort, etc.  More experience will be available soon, based on ongoing work in
 | 
|---|
| 508 | the "future works" section. Specifically, non-blocking I/O is working with the
 | 
|---|
| 509 | new Linux io_uring interface and a new high-performance ready-queue is under
 | 
|---|
| 510 | construction to take into account this change. With non-blocking I/O, it will
 | 
|---|
| 511 | be possible to write applications like high-performance web servers, as is now
 | 
|---|
| 512 | done in Rust and Go. Also, completed is Java-style executors for work-based
 | 
|---|
| 513 | concurrent programming and futures. Under construction is a high-performance
 | 
|---|
| 514 | actor system.
 | 
|---|
| 515 | 
 | 
|---|
| 516 | 
 | 
|---|
| 517 |     It does not seem right to state that a stack is essential to Von Neumann
 | 
|---|
| 518 |     architectures -- since the earliest Von Neumann machines (and indeed early
 | 
|---|
| 519 |     Fortran) did not use one.
 | 
|---|
| 520 | 
 | 
|---|
| 521 | Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
 | 
|---|
| 522 | https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
 | 
|---|
| 523 | 
 | 
|---|
| 524 |   Since a subprogram may call for other subprograms to any desired depth, a
 | 
|---|
| 525 |   particular CALL statement may be defined by a pyramid of multi-level
 | 
|---|
| 526 |   subprograms.
 | 
|---|
| 527 | 
 | 
|---|
| 528 | I think we may be differing on the meaning of stack. You may be imagining a
 | 
|---|
| 529 | modern stack that grows and shrinks dynamically. Whereas early Fortran
 | 
|---|
| 530 | preallocated a stack frame for each function, like Python allocates a frame for
 | 
|---|
| 531 | a generator.  Within each preallocated Fortran function is a frame for local
 | 
|---|
| 532 | variables and a pointer to store the return value for a call.  The Fortran
 | 
|---|
| 533 | call/return mechanism uses these frames to build a traditional call stack
 | 
|---|
| 534 | linked by the return pointer. The only restriction is that a function stack
 | 
|---|
| 535 | frame can only be used once, implying no direct or indirect recursion.  Hence,
 | 
|---|
| 536 | without a stack mechanism, there can be no call/return to "any desired depth",
 | 
|---|
| 537 | where the maximum desired depth is limited by the number of functions. So
 | 
|---|
| 538 | call/return requires some form of a stack, virtually all programming language
 | 
|---|
| 539 | have call/return, past and present, and these languages run on Von Neumann
 | 
|---|
| 540 | machines that do not distinguish between program and memory space, have mutable
 | 
|---|
| 541 | state, and the concept of a pointer to data or code.
 | 
|---|
| 542 | 
 | 
|---|
| 543 | 
 | 
|---|
| 544 |     To elaborate on something another reviewer commented on: it is a surprise
 | 
|---|
| 545 |     to find a "Future work" section *after* the "Conclusion" section. A
 | 
|---|
| 546 |     "Conclusions and future work" section often works well.
 | 
|---|
| 547 | 
 | 
|---|
| 548 | Done.
 | 
|---|
| 549 | 
 | 
|---|
| 550 | 
 | 
|---|
| 551 | 
 | 
|---|
| 552 | Reviewing: 3
 | 
|---|
| 553 | 
 | 
|---|
| 554 |     but it remains really difficult to have a good sense of which idea I should
 | 
|---|
| 555 |     use and when. This applies in different ways to different features from the
 | 
|---|
| 556 |     language:
 | 
|---|
| 557 | 
 | 
|---|
| 558 |     * coroutines/generators/threads: here there is some discussion, but it can
 | 
|---|
| 559 |       be improved.
 | 
|---|
| 560 |     * internal/external scheduling: I didn't find any direct comparison between
 | 
|---|
| 561 |       these features, except by way of example.
 | 
|---|
| 562 | 
 | 
|---|
| 563 | See changes below.
 | 
|---|
| 564 | 
 | 
|---|
| 565 | 
 | 
|---|
| 566 |     I would have preferred something more like a table or a few paragraphs
 | 
|---|
| 567 |     highlighting the key reasons one would pick one construct or the other.
 | 
|---|
| 568 | 
 | 
|---|
| 569 | Section 2.1 is changed to address this suggestion.
 | 
|---|
| 570 | 
 | 
|---|
| 571 | 
 | 
|---|
| 572 |     The discussion of clusters and pre-emption in particular feels quite rushed.
 | 
|---|
| 573 | 
 | 
|---|
| 574 | We believe a brief introduction to the Cforall runtime structure is important
 | 
|---|
| 575 | because clustering within a user-level versus distributed system is unusual,
 | 
|---|
| 576 | Furthermore, the explanation of preemption is important because several new
 | 
|---|
| 577 | languages, like Go and Rust tokio, are not preemptive. Rust threads are
 | 
|---|
| 578 | preemptive only because it uses kernel threads, which UNIX preempts.
 | 
|---|
| 579 | 
 | 
|---|
| 580 | 
 | 
|---|
| 581 |     * Recommend to shorten the comparison on coroutine/generator/threads in
 | 
|---|
| 582 |       Section 2 to a paragraph with a few examples, or possibly a table
 | 
|---|
| 583 |       explaining the trade-offs between the constructs
 | 
|---|
| 584 | 
 | 
|---|
| 585 | Not done, see below.
 | 
|---|
| 586 | 
 | 
|---|
| 587 | 
 | 
|---|
| 588 |     * Recommend to clarify the relationship between internal/external
 | 
|---|
| 589 |       scheduling -- is one more general but more error-prone or low-level?
 | 
|---|
| 590 | 
 | 
|---|
| 591 | Done, see below.
 | 
|---|
| 592 | 
 | 
|---|
| 593 | 
 | 
|---|
| 594 |     There is obviously a lot of overlap between these features, and in
 | 
|---|
| 595 |     particular between coroutines and generators. As noted in the previous
 | 
|---|
| 596 |     review, many languages have chosen to offer *only* generators, and to
 | 
|---|
| 597 |     build coroutines by stacks of generators invoking one another.
 | 
|---|
| 598 | 
 | 
|---|
| 599 | Coroutines built from stacks of generators have problems, such as no symmetric
 | 
|---|
| 600 | control-flow. Furthermore, stacks of generators have a problem with the
 | 
|---|
| 601 | following programming pattern.  logger is a library function called from a
 | 
|---|
| 602 | function or a coroutine, where the doit for the coroutine suspends. With stacks
 | 
|---|
| 603 | of generators, there has to be a function and generator version of logger to
 | 
|---|
| 604 | support these two scenarios. If logger is a library function, it may be
 | 
|---|
| 605 | impossible to create the generator logger because the logger function is
 | 
|---|
| 606 | opaque.
 | 
|---|
| 607 | 
 | 
|---|
| 608 |   #include <fstream.hfa>
 | 
|---|
| 609 |   #include <coroutine.hfa>
 | 
|---|
| 610 | 
 | 
|---|
| 611 |   forall( otype T | { void doit( T ); } )
 | 
|---|
| 612 |   void logger( T & t ) {
 | 
|---|
| 613 |       doit( t );
 | 
|---|
| 614 |   }
 | 
|---|
| 615 | 
 | 
|---|
| 616 |   coroutine C {};
 | 
|---|
| 617 |   void main( C & c ) with( c ) {
 | 
|---|
| 618 |       void doit( C & c ) { suspend; }
 | 
|---|
| 619 |       logger( c );
 | 
|---|
| 620 |   }
 | 
|---|
| 621 |   void mem( C & c ) {
 | 
|---|
| 622 |       resume( c );
 | 
|---|
| 623 |   }
 | 
|---|
| 624 | 
 | 
|---|
| 625 |   int main() {
 | 
|---|
| 626 |       C c;
 | 
|---|
| 627 |       mem( c );
 | 
|---|
| 628 |       mem( c );
 | 
|---|
| 629 |   
 | 
|---|
| 630 |       struct S {};
 | 
|---|
| 631 |       S s;
 | 
|---|
| 632 |       void doit( S & s ) {}
 | 
|---|
| 633 |       logger( s );
 | 
|---|
| 634 |   }
 | 
|---|
| 635 | 
 | 
|---|
| 636 | Additional text has been added to the start of Section 3.2 addressing this
 | 
|---|
| 637 | comment.
 | 
|---|
| 638 | 
 | 
|---|
| 639 | 
 | 
|---|
| 640 |     In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
 | 
|---|
| 641 |     that embodies this "top down" approach. It starts, "programmers can now
 | 
|---|
| 642 |     answer three basic questions", and thus gives some practical advice for
 | 
|---|
| 643 |     which construct you should use and when. I think giving some examples of
 | 
|---|
| 644 |     specific applications that this paragraph, combined with some examples of
 | 
|---|
| 645 |     cases where each construct was needed, would be a better approach.
 | 
|---|
| 646 | 
 | 
|---|
| 647 |     I don't think this comparison needs to be very long. It seems clear enough
 | 
|---|
| 648 |     that one would
 | 
|---|
| 649 | 
 | 
|---|
| 650 |     * prefer generators for simple computations that yield up many values,
 | 
|---|
| 651 |     * prefer coroutines for more complex processes that have significant
 | 
|---|
| 652 |       internal structure,
 | 
|---|
| 653 | 
 | 
|---|
| 654 | We do not believe the number of values yielded is an important factor is
 | 
|---|
| 655 | choosing between a generator or coroutine; either form can receive (input) or
 | 
|---|
| 656 | return (output) millions of values. As well, simple versus complex computations
 | 
|---|
| 657 | is also not an absolute criterion for selection as both forms can be simple or
 | 
|---|
| 658 | complex.
 | 
|---|
| 659 | 
 | 
|---|
| 660 | As stated in the paper, a coroutine brings generality to the implementation
 | 
|---|
| 661 | because of the addition stack, whereas generators have restrictions on standard
 | 
|---|
| 662 | software-engineering practices: variable placement, no helper functions without
 | 
|---|
| 663 | creating an explicit generator stack, and no symmetric control-flow.  It is
 | 
|---|
| 664 | true that simple generators probably suffer less from restrictions. Also,
 | 
|---|
| 665 | simplicity means the amount of work done is small so the cost of the
 | 
|---|
| 666 | suspend/resume becomes a factor, where experiments show suspend/resume for a
 | 
|---|
| 667 | generator is 10 times faster than for a coroutines.
 | 
|---|
| 668 | 
 | 
|---|
| 669 | 
 | 
|---|
| 670 |     * prefer threads for cases where parallel execution is desired or needed.
 | 
|---|
| 671 | 
 | 
|---|
| 672 | Agreed, but this description does not mention mutual exclusion and
 | 
|---|
| 673 | synchronization, which is essential in any meaningful concurrent program.
 | 
|---|
| 674 | 
 | 
|---|
| 675 | Our point here is to illustrate that a casual "top down" explanation is
 | 
|---|
| 676 | insufficient to explain the complexity of the underlying execution properties.
 | 
|---|
| 677 | We presented some rule-of-thumbs at the end of Section 2 but programmers must
 | 
|---|
| 678 | understand all the underlying mechanisms and their interactions to exploit the
 | 
|---|
| 679 | execution properties to their fullest, and to understand when a programming
 | 
|---|
| 680 | language does or does not provide a desired mechanism.
 | 
|---|
| 681 | 
 | 
|---|
| 682 | 
 | 
|---|
| 683 |     I did appreciate the comparison in Section 2.3 between async-await in
 | 
|---|
| 684 |     JS/Java and generators/coroutines. I agree with its premise that those
 | 
|---|
| 685 |     mechanisms are a poor replacement for generators (and, indeed, JS has a
 | 
|---|
| 686 |     distinct generator mechanism, for example, in part for this reason).  I
 | 
|---|
| 687 |     believe I may have asked for this in a previous review, but having read it,
 | 
|---|
| 688 |     I wonder if it is really necessary, since those mechanisms are so different
 | 
|---|
| 689 |     in purpose.
 | 
|---|
| 690 | 
 | 
|---|
| 691 | Given that you asked about this before, I believe other readers might also ask
 | 
|---|
| 692 | the same question because async-await is very popular. So I think this section
 | 
|---|
| 693 | does help to position the work in the paper among other work, and hence, it is
 | 
|---|
| 694 | appropriate to keep it in the paper.
 | 
|---|
| 695 | 
 | 
|---|
| 696 | 
 | 
|---|
| 697 |     I find the motivation for supporting both internal and external scheduling
 | 
|---|
| 698 |     to be fairly implicit. After several reads through the section, I came to
 | 
|---|
| 699 |     the conclusion that internal scheduling is more expressive than external
 | 
|---|
| 700 |     scheduling, but sometimes less convenient or clear. Is this correct? If
 | 
|---|
| 701 |     not, it'd be useful to clarify where external scheduling is more
 | 
|---|
| 702 |     expressive.
 | 
|---|
| 703 | 
 | 
|---|
| 704 |     I would find it very interesting to try and capture some of the properties
 | 
|---|
| 705 |     that make internal vs external scheduling the better choice.
 | 
|---|
| 706 | 
 | 
|---|
| 707 |     For example, it seems to me that external scheduling works well if there
 | 
|---|
| 708 |     are only a few "key" operations, but that internal scheduling might be
 | 
|---|
| 709 |     better otherwise, simply because it would be useful to have the ability to
 | 
|---|
| 710 |     name a signal that can be referenced by many methods.
 | 
|---|
| 711 | 
 | 
|---|
| 712 | To address this point, the last paragraph on page 22 (now page 23) has been
 | 
|---|
| 713 | augmented to the following:
 | 
|---|
| 714 | 
 | 
|---|
| 715 |  Given external and internal scheduling, what guidelines can a programmer use
 | 
|---|
| 716 |  to select between them?  In general, external scheduling is easier to
 | 
|---|
| 717 |  understand and code because only the next logical action (mutex function(s))
 | 
|---|
| 718 |  is stated, and the monitor implicitly handles all the details.  Therefore,
 | 
|---|
| 719 |  there are no condition variables, and hence, no wait and signal, which reduces
 | 
|---|
| 720 |  coding complexity and synchronization errors.  If external scheduling is
 | 
|---|
| 721 |  simpler than internal, why not use it all the time?  Unfortunately, external
 | 
|---|
| 722 |  scheduling cannot be used if: scheduling depends on parameter value(s) or
 | 
|---|
| 723 |  scheduling must block across an unknown series of calls on a condition
 | 
|---|
| 724 |  variable (\ie internal scheduling).  For example, the dating service cannot be
 | 
|---|
| 725 |  written using external scheduling.  First, scheduling requires knowledge of
 | 
|---|
| 726 |  calling parameters to make matching decisions and parameters of calling
 | 
|---|
| 727 |  threads are unavailable within the monitor.  Specifically, a thread within the
 | 
|---|
| 728 |  monitor cannot examine the @ccode@ of threads waiting on the calling queue to
 | 
|---|
| 729 |  determine if there is a matching partner.  (Similarly, if the bounded buffer
 | 
|---|
| 730 |  or readers/writer are restructured with a single interface function with a
 | 
|---|
| 731 |  parameter denoting producer/consumer or reader/write, they cannot be solved
 | 
|---|
| 732 |  with external scheduling.)  Second, a scheduling decision may be delayed
 | 
|---|
| 733 |  across an unknown number of calls when there is no immediate match so the
 | 
|---|
| 734 |  thread in the monitor must block on a condition.  Specifically, if a thread
 | 
|---|
| 735 |  determines there is no opposite calling thread with the same @ccode@, it must
 | 
|---|
| 736 |  wait an unknown period until a matching thread arrives.  For complex
 | 
|---|
| 737 |  synchronization, both external and internal scheduling can be used to take
 | 
|---|
| 738 |  advantage of best of properties of each.
 | 
|---|
| 739 | 
 | 
|---|
| 740 | 
 | 
|---|
| 741 |     Consider the bounded buffer from Figure 13: if it had multiple methods for
 | 
|---|
| 742 |     removing elements, and not just `remove`, then the `waitfor(remove)` call
 | 
|---|
| 743 |     in `insert` might not be sufficient.
 | 
|---|
| 744 | 
 | 
|---|
| 745 | Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
 | 
|---|
| 746 | your request:
 | 
|---|
| 747 | 
 | 
|---|
| 748 |   waitfor( remove : buffer ); or waitfor( remove2 : buffer );
 | 
|---|
| 749 | 
 | 
|---|
| 750 | and its shorthand form (not shown in the paper)
 | 
|---|
| 751 | 
 | 
|---|
| 752 |   waitfor( remove, remove2 : buffer );
 | 
|---|
| 753 | 
 | 
|---|
| 754 | A call to either of these remove functions satisfies the waitfor (exact
 | 
|---|
| 755 | selection details are discussed in Section 6.4).
 | 
|---|
| 756 | 
 | 
|---|
| 757 | 
 | 
|---|
| 758 |     The same is true, I think, of the `signal_block` function, which I
 | 
|---|
| 759 |     have not encountered before;
 | 
|---|
| 760 | 
 | 
|---|
| 761 | In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
 | 
|---|
| 762 | Structuring Concept", it states on page 551:
 | 
|---|
| 763 | 
 | 
|---|
| 764 |  When a process signals a condition on which another process is waiting, the
 | 
|---|
| 765 |  signalling process must wait until the resumed process permits it to
 | 
|---|
| 766 |  proceed. We therefore introduce for each monitor a second semaphore "urgent"
 | 
|---|
| 767 |  (initialized to 0), on which signalling processes suspend themselves by the
 | 
|---|
| 768 |  operation P(urgent).
 | 
|---|
| 769 | 
 | 
|---|
| 770 | Hence, the original definition of signal is in fact signal_block, i.e., the
 | 
|---|
| 771 | signaller blocks. Later implementations of monitor switched to signaller
 | 
|---|
| 772 | nonblocking because most signals occur before returns, which allows the
 | 
|---|
| 773 | signaller to continue execution, exit the monitor, and run concurrently with
 | 
|---|
| 774 | the signalled thread that restarts in the monitor. When the signaller is not
 | 
|---|
| 775 | going to exit immediately, signal_block is appropriate.
 | 
|---|
| 776 | 
 | 
|---|
| 777 | 
 | 
|---|
| 778 |     it seems like its behavior can be modeled with multiple condition
 | 
|---|
| 779 |     variables, but that's clearly more complex.
 | 
|---|
| 780 | 
 | 
|---|
| 781 | Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
 | 
|---|
| 782 | Surveys, 27(1):63-107, that all extant monitors with different signalling
 | 
|---|
| 783 | semantics can be transformed into each other. However, some transformations are
 | 
|---|
| 784 | complex and runtime expensive.
 | 
|---|
| 785 | 
 | 
|---|
| 786 | 
 | 
|---|
| 787 |     One question I had about `signal_block`: what happens if one signals
 | 
|---|
| 788 |     but no other thread is waiting? Does it block until some other thread
 | 
|---|
| 789 |     waits? Or is that user error?
 | 
|---|
| 790 | 
 | 
|---|
| 791 | On page 20, it states:
 | 
|---|
| 792 | 
 | 
|---|
| 793 |   Signalling is unconditional because signalling an empty condition queue does
 | 
|---|
| 794 |   nothing.
 | 
|---|
| 795 | 
 | 
|---|
| 796 | To the best of our knowledge, all monitors have the same semantics for
 | 
|---|
| 797 | signalling an empty condition queue, regardless of the kind of signal, i.e.,
 | 
|---|
| 798 | signal or signal_block.
 | 
|---|
| 799 | 
 | 
|---|
| 800 | 
 | 
|---|
| 801 |     I believe that one difference between the Go program and the Cforall
 | 
|---|
| 802 |     equivalent is that the Goroutine has an associated queue, so that
 | 
|---|
| 803 |     multiple messages could be enqueued, whereas the Cforall equivalent is
 | 
|---|
| 804 |     effectively a "bounded buffer" of length 1. Is that correct?
 | 
|---|
| 805 | 
 | 
|---|
| 806 | Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
 | 
|---|
| 807 | send so both are synchronous communication.
 | 
|---|
| 808 | 
 | 
|---|
| 809 | 
 | 
|---|
| 810 |     I think this should be stated explicitly. (Presumably, one could modify the
 | 
|---|
| 811 |     Cforall program to include an explicit vector of queued messages if
 | 
|---|
| 812 |     desired, but you would also be reimplementing the channel abstraction.)
 | 
|---|
| 813 | 
 | 
|---|
| 814 | Fixed, by adding the following sentences:
 | 
|---|
| 815 | 
 | 
|---|
| 816 |   The different between call and channel send occurs for buffered channels
 | 
|---|
| 817 |   making the send asynchronous.  In \CFA, asynchronous call and multiple
 | 
|---|
| 818 |   buffers is provided using an administrator and worker
 | 
|---|
| 819 |   threads~\cite{Gentleman81} and/or futures (not discussed).
 | 
|---|
| 820 | 
 | 
|---|
| 821 | 
 | 
|---|
| 822 |     Also, in Figure 20, I believe that there is a missing `mutex` keyword.
 | 
|---|
| 823 |     
 | 
|---|
| 824 | Fixed.
 | 
|---|
| 825 | 
 | 
|---|
| 826 | 
 | 
|---|
| 827 |     I was glad to see that the paper acknowledged that Cforall still had
 | 
|---|
| 828 |     low-level atomic operations, even if their use is discouraged in favor of
 | 
|---|
| 829 |     higher-level alternatives.
 | 
|---|
| 830 | 
 | 
|---|
| 831 | There was never an attempt to not acknowledged that Cforall had low-level
 | 
|---|
| 832 | atomic operations. The original version of the paper stated:
 | 
|---|
| 833 | 
 | 
|---|
| 834 |   6.6 Low-level Locks
 | 
|---|
| 835 |   For completeness and efficiency, Cforall provides a standard set of low-level
 | 
|---|
| 836 |   locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
 | 
|---|
| 837 |   instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
 | 
|---|
| 838 | 
 | 
|---|
| 839 | and that section is still in the paper. In fact, we use these low-level
 | 
|---|
| 840 | mechanisms to build all of the high-level concurrency constructs in Cforall.
 | 
|---|
| 841 |   
 | 
|---|
| 842 | 
 | 
|---|
| 843 |     However, I still feel that the conclusion overstates the value of the
 | 
|---|
| 844 |     contribution here when it says that "Cforall high-level race-free monitors
 | 
|---|
| 845 |     and threads provide the core mechanisms for mutual exclusion and
 | 
|---|
| 846 |     synchronization, without the need for volatile and atomics". I feel
 | 
|---|
| 847 |     confident that Java programmers, for example, would be advised to stick
 | 
|---|
| 848 |     with synchronized methods whenever possible, and it seems to me that they
 | 
|---|
| 849 |     offer similar advantages -- but they sometimes wind up using volatiles for
 | 
|---|
| 850 |     performance reasons.
 | 
|---|
| 851 | 
 | 
|---|
| 852 | I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
 | 
|---|
| 853 | programs can achieve very good performance without volatile or atomics because
 | 
|---|
| 854 | the runtime system has already used these low-level capabilities to build an
 | 
|---|
| 855 | efficient set of high-level concurrency constructs.
 | 
|---|
| 856 | 
 | 
|---|
| 857 | 0.1% of the time programmers need to build their own locks and synchronization
 | 
|---|
| 858 | mechanisms. This need also occurs for storage management. Both of these
 | 
|---|
| 859 | mechanisms are allowed in Cforall but are fraught with danger and should be
 | 
|---|
| 860 | discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
 | 
|---|
| 861 | fencing for a WSO memory model, such as on the ARM. It doesn't help that the
 | 
|---|
| 862 | C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
 | 
|---|
| 863 | Dave Dice and me would agree that 99% of hand-crafted locks created by
 | 
|---|
| 864 | programmers are broken and/or non-portable.
 | 
|---|
| 865 | 
 | 
|---|
| 866 | 
 | 
|---|
| 867 |     I was also confused by the term "race-free" in that sentence. In
 | 
|---|
| 868 |     particular, I don't think that Cforall has any mechanisms for preventing
 | 
|---|
| 869 |     *data races*, and it clearly doesn't prevent "race conditions" (which would
 | 
|---|
| 870 |     bar all sorts of useful programs). I suppose that "race free" here might be
 | 
|---|
| 871 |     referring to the improvements such as removing barging behavior.
 | 
|---|
| 872 | 
 | 
|---|
| 873 | We use the term "race free" to mean the same as Boehm/Adve's "data-race
 | 
|---|
| 874 | freedom" in
 | 
|---|
| 875 | 
 | 
|---|
| 876 |   Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
 | 
|---|
| 877 |   Memory Models. Communications ACM. 2012; 55(2):48-54.
 | 
|---|
| 878 |   https://queue.acm.org/detail.cfm?id=2088916
 | 
|---|
| 879 | 
 | 
|---|
| 880 | which is cited in the paper. Furthermore, we never said that Cforall has
 | 
|---|
| 881 | mechanisms for preventing *all* data races. We said Cforall high-level
 | 
|---|
| 882 | race-free monitors and threads[, when used with mutex access function] (added
 | 
|---|
| 883 | to the paper), implies no data races within these constructs, unless a
 | 
|---|
| 884 | programmer directly publishes shared state. This approach is exactly what
 | 
|---|
| 885 | Boehm/Adve advocate for the vast majority of concurrent programming.
 | 
|---|
| 886 | 
 | 
|---|
| 887 | 
 | 
|---|
| 888 |     It would perhaps be more interesting to see a comparison built using [tokio] or
 | 
|---|
| 889 |     [async-std], two of the more prominent user-space threading libraries that
 | 
|---|
| 890 |     build on Rust's async-await feature (which operates quite differently than
 | 
|---|
| 891 |     Javascript's async-await, in that it doesn't cause every aync function call to
 | 
|---|
| 892 |     schedule a distinct task).
 | 
|---|
| 893 | 
 | 
|---|
| 894 | Done.
 | 
|---|
| 895 | 
 | 
|---|
| 896 | 
 | 
|---|
| 897 |     Several figures used the `with` keyword. I deduced that `with(foo)` permits
 | 
|---|
| 898 |     one to write `bar` instead of `foo.bar`. It seems worth introducing.
 | 
|---|
| 899 |     Apologies if this is stated in the paper, if so I missed it.
 | 
|---|
| 900 | 
 | 
|---|
| 901 | Page 6, footnote F states:
 | 
|---|
| 902 | 
 | 
|---|
| 903 |   The Cforall "with" clause opens an aggregate scope making its fields directly
 | 
|---|
| 904 |   accessible, like Pascal "with", but using parallel semantics; multiple
 | 
|---|
| 905 |   aggregates may be opened.
 | 
|---|
| 906 | 
 | 
|---|
| 907 | 
 | 
|---|
| 908 |     On page 20, section 6.3, "external scheduling and vice versus" should be
 | 
|---|
| 909 |     "external scheduling and vice versa".
 | 
|---|
| 910 | 
 | 
|---|
| 911 | Fixed.
 | 
|---|
| 912 | 
 | 
|---|
| 913 | 
 | 
|---|
| 914 |     On page 5, section 2.3, the paper states "we content" but it should be "we
 | 
|---|
| 915 |     contend".
 | 
|---|
| 916 | 
 | 
|---|
| 917 | Fixed.
 | 
|---|
| 918 | 
 | 
|---|
| 919 | 
 | 
|---|
| 920 |     Page 1. I don't believe that it s fair to imply that Scala is "research
 | 
|---|
| 921 |     vehicle" as it is used by major players, Twitter being the most prominent
 | 
|---|
| 922 |     example.
 | 
|---|
| 923 | 
 | 
|---|
| 924 | Fixed. Removed "research".
 | 
|---|
| 925 | 
 | 
|---|
| 926 | 
 | 
|---|
| 927 |     Page 15. Must Cforall threads start after construction (e.g. see your
 | 
|---|
| 928 |     example on page 15, line 21)?
 | 
|---|
| 929 | 
 | 
|---|
| 930 | Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
 | 
|---|
| 931 | programmers want threads to start immediately. (Most Java programs have no code
 | 
|---|
| 932 | between a thread declaration and the call to start the thread.)  Therefore,
 | 
|---|
| 933 | this semantic should be the default because (see page 13):
 | 
|---|
| 934 | 
 | 
|---|
| 935 |   Alternatives, such as explicitly starting threads as in Java, are repetitive
 | 
|---|
| 936 |   and forgetting to call start is a common source of errors.
 | 
|---|
| 937 | 
 | 
|---|
| 938 | To handle the other 5% of the cases, there is a trivial Cforall pattern
 | 
|---|
| 939 | providing Java-style start/join. The additional cost for this pattern is small
 | 
|---|
| 940 | in comparison to the work performed between the start and join.
 | 
|---|
| 941 | 
 | 
|---|
| 942 |   thread T {};
 | 
|---|
| 943 |   void start( T & mutex ) {} // any function name
 | 
|---|
| 944 |   void join( T & mutex ) {} // any function name
 | 
|---|
| 945 |   void main( T & t ) {
 | 
|---|
| 946 |       sout | "start";
 | 
|---|
| 947 |       waitfor( start : t ); // wait to be started
 | 
|---|
| 948 |       sout | "restart"; // perform work
 | 
|---|
| 949 |       waitfor( join : t ); // wait to be joined
 | 
|---|
| 950 |       sout | "join";
 | 
|---|
| 951 |   }
 | 
|---|
| 952 | 
 | 
|---|
| 953 |   int main() {
 | 
|---|
| 954 |       T t[3]; // threads start and delay
 | 
|---|
| 955 |       sout | "need to start";
 | 
|---|
| 956 |       for ( i; 3 ) start( t[i] );
 | 
|---|
| 957 |       sout | "need to join";
 | 
|---|
| 958 |       for ( i; 3 ) join( t[i] );
 | 
|---|
| 959 |       sout | "threads stopped";
 | 
|---|
| 960 |   } // threads deleted from stack
 | 
|---|
| 961 | 
 | 
|---|
| 962 |   $ a.out
 | 
|---|
| 963 |   need to start
 | 
|---|
| 964 |   start
 | 
|---|
| 965 |   start
 | 
|---|
| 966 |   start
 | 
|---|
| 967 |   need to join
 | 
|---|
| 968 |   restart
 | 
|---|
| 969 |   restart
 | 
|---|
| 970 |   restart
 | 
|---|
| 971 |   threads stopped
 | 
|---|
| 972 |   join
 | 
|---|
| 973 |   join
 | 
|---|
| 974 |   join
 | 
|---|
| 975 | 
 | 
|---|
| 976 | 
 | 
|---|
| 977 |     Page 18, line 17: is using
 | 
|---|
| 978 | 
 | 
|---|
| 979 | Fixed.
 | 
|---|