source: doc/papers/concurrency/mail @ a573c22

Last change on this file since a573c22 was 97a1544, checked in by Peter A. Buhr <pabuhr@…>, 5 years ago

email related to paper

  • Property mode set to 100644
File size: 30.1 KB
1Date: Sun, 1 Jul 2018 12:31:39 +0000
2From: Kevin Flores <>
5Subject: SPE-18-0205 successfully submitted
10Dear Dr Buhr,
12Your manuscript entitled "Concurrency in C∀" has been received by Software:
13Practice and Experience. It will be given full consideration for publication in
14the journal.
16Your manuscript number is SPE-18-0205.  Please mention this number in all
17future correspondence regarding this submission.
19You can view the status of your manuscript at any time by checking your Author
20Center after logging into  If you have
21difficulty using this site, please click the 'Get Help Now' link at the top
22right corner of the site.
24Thank you for submitting your manuscript to Software: Practice and Experience.
28Software: Practice and Experience Editorial Office
32Date: Wed, 3 Oct 2018 21:25:28 +0000
33From: Richard Jones <>
36Subject: Software: Practice and Experience - Decision on Manuscript ID
37 SPE-18-0205
41Dear Dr Buhr,
43Many thanks for submitting SPE-18-0205 entitled "Concurrency in C∀" to Software: Practice and Experience.
45In view of the comments of the referees found at the bottom of this letter, I cannot accept your paper for publication in Software: Practice and Experience. I hope that you find the referees' very detailed comments helpful.
47Thank you for considering Software: Practice and Experience for the publication of your research.  I hope the outcome of this specific submission will not discourage you from submitting future manuscripts.
49Yours sincerely,
52Prof. Richard Jones
53Editor, Software: Practice and Experience
56Referee(s)' Comments to Author:
58Reviewing: 1
60Comments to the Author
61"Concurrency in Cforall" presents a design and implementation of a set of standard concurrency features, including coroutines, user-space and kernel-space threads, mutexes, monitors, and a scheduler, for a polymorphic derivation of C called Cforall.
63Section 2 is an overview of sequential Cforall that does not materially contribute to the paper. A brief syntax explanation where necessary in examples would be plenty.
65Section 3 begins with with an extensive discussion of concurrency that also does not materially contribute to the paper. A brief mention of whether a particular approach implements cooperative or preemptive scheduling would be sufficient. Section 3 also makes some unfortunate claims, such as C not having threads -- C does in fact define threads, and this is noted as being true in a footnote, immediately after claiming that it does not. The question remains why the C11 parallelism design is insufficient and in what way this paper proposes to augment it. While I am personally a proponent of parallel programming languages, backing the assertion that all modern languages must have threading with citations from 2005 ignores the massive popularity of modern non-parallel languages (Javascript, node.js, Typescript, Python, Ruby, etc.) and parallel languages that are not thread based, although the authors are clearly aware of such approaches.
67Sections 3.1 and 3.2 dicusses assymetric and symmetric coroutines. This also does not seem to materially contribute to a paper that is ostensibly about concurrency in a modern systems programming language. The area of coroutines, continuations, and generators is already well explored in the context of systems languages, including compilation techniques for these constructs that are more advanced than the stack instantiation model discussed in the paper.
69Section 3.3 describes threads in Cforall, briefly touching on user-space vs. kernel-space thread implementations without detailing the extensive practical differences. It is unclear how the described interface differes from C++11 threads, as the description seems to center on an RAII style approach to joining in the destructor.
71Section 4 briefly touches on a collection of well known synchronisation primitives. Again, this discussion does not materially contribute to the paper.
73Section 5 describes monitors, which are a well known and well researched technique. The Cforall implementation is unsurprising. The "multi-acquire semantics" described are not a contribution of this paper, as establishing a stable order for lock acquisition is a well known technique, one example of which is the C++ std::scoped_lock.
75Section 6 is a discussion of scheduling that does not appear to be informed by the literature. There is no discussion of work-stealing vs. work-scheduling, static vs. dynamic priorities, priority inversion, or fairness. There is a claim in secion 6.1 for a novel technique, partial signalling, that appears to be a form of dynamic priority, but no comparison is made. In section 6.6, a very brief mention of other synchronisation techniques is made, without reference to current techniques such as array-based locks, CLH or MCS queue locks, RCU and other epoch-based mechanisms, etc. Perhaps these are considered out of scope.
77Section 7 discusses parallelism, but does not materially contribute to the paper. It is claimed that preemption is necessary to implement spinning, which is not correct, since two cores can implement a spinning based approach without preemption. It is claimed that with thread pools "concurrency errors return", but no approach to removing concurrency errors with either preemptive or cooperatively scheduled user threads has been proposed in the paper that would not also apply to thread pools.
79Section 8 is intended to describe the Cforall runtime structure, but does so in a way that uses terminology in an unfamiliar way. The word cluster is more usually used in distributed systems, but here refers to a process. The term virtual processor is more usually used in hardware virtualisation, but here refers to a kernel thread. The term debug kernel is more usually used in operating systems to refer to kernels that have both debug info and a method for using a debugger in kernel space, but here refers to a debug build of a user-space process. This section does not materially contribute to the paper.
81Section 9 is intended to describe the Cforall runtime implementation. It makes some unusual claims, such as C libraries migrating to stack chaining (stack chaining was an experimental GCC feature that has been abandoned, much as it has been abandoned in both Go and Rust).
83The performance measurements in section 10 are difficult to evaluate. While I appreciate that comparable concurrency benchmarks are very difficult to write, and the corpus of existing benchmarks primarily boils down to the parallel programs in the Computer Language Benchmark Game, the lack of detail as to what is being measured in these benchmarks (particularly when implemented in other languages) is unfortunate. For example, in table 3, the benchmark appears to measure uncontended lock access, which is not a useful micro-benchmark.
85It is not clear what the contributions of this paper are intended to be. A concise listing of the intended contributions would be helpful. Currently, it appears that the paper makes neither PL contributions in terms of novel features in Cforall, nor does it make systems contributions in terms of novel features in the runtime.
88Reviewing: 2
90Comments to the Author
91This article presents the design and rationale behind the concurrency
92features of C-forall, a new low-level programming language.  After an
93introduction that defines a selection of standard terminology, section
942 gives crucial background on the design of the C-forall language.
95Section 3 then starts the core of the article, discussing the
96language's support for "concurrency" which in this case means
97coroutines and threads; a very brief Section 4 builds on section 3
98with a discussion of lower level synchronizations.  Section 5 the
99presents the main features of concurrency control in C-forall:
100monitors and mutexes. Section 6 then extends monitors with condition
101variables to to support scheduling, and a very brief section 7
102discusses preemption and pooling. Section 8 discusses the runtime
103conceptual model, section 9 gives implementation detail, and section
10410 briefly evaluates C-forall's performance via five concurrent
105micro benchmarks. Finally section 11 concludes the article, and then
106section 12 presents some future work. 
109At the start of section 7, article lays out its rationale: that while
110"historically, computer performance was about processor speeds" but
111"Now, high-performance applications must care about parallelism,
112which requires concurrency". The doomsayers trumpeting the death of
113Moore's law have been proved correct at last, with CPUs sequential
114performance increasing much more slowly than the number of cores
115within each die. This means programmers --- especially low-level,
116systems programmers --- must somehow manage the essential complexity
117of writing concurrent programs to run in parallel in multiple threads
118across multiple cores. Unfortunately, the most venerable widely used
119systems programming language, C, supports parallelism only via an
120e.g. the threads library.  This article aims to integrate concurrent
121programming mechanisms more closely into a novel low-level C-based
122programming language, C-forall. The article gives an outline of much of
123C-forall, presents a series of concurrency mechanisms, and finally
124some microbenchmark results.  The article is detailed, comprehensive,
125and generally well written in understandable English.
127My main concern about the article are indicated by the fact that the
128best summary of the problem the design of concurrent C-forall sets
129out to solve is buried more than halfway through the article in section
1307, as above, and then the best overview of the proposed solution is
131given in the 2nd, 4th and 5th sentence of the conclusion:
133   "The approach provides concurrency based on a preemptive M:N
134    user-level threading-system, executing in clusters, which
135    encapsulate scheduling of work on multiple kernel threads
136    providing parallelism... High-level objects (monitor/task) are the
137    core mechanism for mutual exclusion and synchronization. A novel
138    aspect is allowing multiple mutex-objects to be accessed
139    simultaneously reducing the potential for deadlock for this
140    complex scenario."
142That is, in my reading of the article, it proceeds bottom up rather
143than top down, and so my main recommendation is to essentially reverse
144the order of the article, proceeding from the problem to be solved,
145the high level architecture of the proposed solutions, and then going
146down to the low-level mechanisms.  My biggest problem reading the
147article was for explanations of why a particular decision was taken,
148or why a particular mechanism may be used --- often this description
149is actually later in the article, but at that point it's too late for
150the reader.  I have tried to point out most of these places in the
151detailed comments below.
153My second concern is that the article makes several claims that are
154not really justified by the design or implementation in the article.
155These include claims that this approach meets the expectations of C
156programmers, is minimal, is implemented in itself, etc.  The article
157doesn't generally offer evidence to support these assertions (for many
158of them, that would require empirical studies of programmers, or at
159least corpus studies). The solution here is to talk about motivations
160for the design choices "we made these decisions hoping that C
161programmers would be comfortable" rather than claims of fact "C
162programmers are comfortable".  Again I attempt to point these out below.
164* abstract: needs to characterize the work top down, and not make
165  claims "features respect the expectations of C programmers" that
166  are not supported empirically.
168* p1 line 14 "integrated"
170* introduction needs to introduce the big ideas and scope of the
171  article, not define terms.  Some of the terms / distinctions are
172  non-standard (e.g. the distinction between "concurrency" and
173  "parallelism") and can be avoided by using more specific terms
174  (mutual exclusion, synchronization, parallel execution. etc).
176* to me this article introduces novel language features, not just an
177  API.  Similarly, it doesn't talk about any additions "to the
178  language translator" - i.e compiler changes! - rather about language
179  features.
182* section 2 lines 6-9 why buy this fight against object-orientation?
183  this article doesn't need to make this argument, but needs to do a
184  better job of it if it does (see other comments below)
186* sec 2.1 - are these the same as C++. IF so, say so, if not, say why
187  not.
189* 2.2 calling it a "with statement" was confusing, given that a with
190  clause can appear in a routine declaration with a shorthand syntax.
192* 2.3 again compare with C++ and Java (as well as Ada)
194* line 9 "as we will see in section 3"
196* 2.4 I really quite like this syntax for operators, destructors not
197  so much.
199* 2.5 and many places elsewhere. Always first describe the semantics
200  of your language constructs, then describe their properties, then
201  compare with e.g. related languages (mostly C++ & Java?).  E.g in
202  this case, something like:
204  "C-forall includes constructors, which are called to initialize
205  newly allocated objects, and constructors, which are called when
206  objects are deallocated. Constructors and destructors are written as
207  functions returning void, under the special names "?{}" for
208  constructors and "^{}" for destructors: constructors may be
209  overridden, but destructors may not be.  The semantics of C-forall's
210  constructors and destructors are essentially those of C++."
212  this problem repeats many times throughout the article and should be
213  fixed everywhere.
216* 2.6 again, first describe then properties then comparison.
217   in this case, compare e.g. with C++ templates, Java/Ada generics
218   etc.
220* why special case forward declarations? It's not 1970 any more.
222* what are traits?  structural interfaces (like Go interfaces) or
223  nominal bindings?
225* section 3 - lines 2-30, also making very specific global definitions
226  as in the introduction. The article does not need to take on this
227  fight either, rather make clear that this is the conceptual model in
228  C-forall. (If the article starts at the top and works down, that may
229  well follow anyway).
231* "in modern programming languages... unacceptable"; "in a
232  system-level language.. concurrent programs should be written with
233  high-level features" - again, no need to take on these fights.
235* 3.1 onwards; I found all this "building" up hard to follow.
236  also it's not clear a "minimal" API must separately support
237  coroutines, threads, fibres, etc
239* FIG 2B - where's the output?
240  syntax "sout | next(f1) | next(f2) | endl" nowhere explained
241    why not use C++s' << and >>
243* FIG 3 be clearer, earlier about the coroutine" constructor syntax
245** ensure all figures are placed *after* their first mention in the
246   text. consider interleaving smaller snippets of text rather than
247   just referring to large figures
249* sec 3.1 p7 etc,. need more context / comparison e.g. Python
250  generators etc.
252* FIGURE 4 is this right?  should there a constructor for Cons taking
253  a Prod?
256* sec 3.2 order of constructors depends on the language.  more
257  generally, if the article is going to make arguments against OO
258  (e.g. section 2) then the article needs to explain, in detail, why
259  e.g. coroutine, thread, etc *cannot* be classes / objects.
261* "type coroutine_t must be an abstract handle.. descriptor and is
262  stack are non-copyable" - too many assumptions in here (and other
263  similar passages) that are not really spelled out in detail.
265* p10 line 4 introduces "coroutine" keyword. needs to give its
266  semantics. also needs to introduce and define properties and compare
267  before all the examples using coroutines.
269* p10 again, trait semantics need to be better defined
271* 3.3 should be an introduction to this section. Note that section
272  titles are not part of the text of the article.
274* what's the difference between "coroutines" and "user threads" (and
275  "fibres?")
277* what's a "task type" or an "interface routine"  or "underlying
278  thread"
280* section 4 - "... meaningless". nope some semantics are possible
281  e.g. if there's a memory model.
283* whatare "call/return based languages"
285* p12 - what if a programmer wants to join e.g. "1st of N" or "1st 3 of N"
286  threads rather than all threads in order
288* 4.1 p12 13-25, again it's not clear where this is going.  presenting the model
289  top down may hopefully resolve this
291* section 4 should be merged e.g. into sec 3 (or 5)
295* section 5 p13 what's "routine" scope. "call/return paradigm"
297* thread/ coroutine declarations, traits etc, all look pretty close to
298  inheritance. why wouldn't inheritance work?
300* open/closed locks = free/acquired free locks?
302* testability?
304* p14 lines 14-20 I had trouble following this.  e.g/. what's the
305  difference between "a type that is a monitor" and "a type that looks
306  like a monitor"?  why?
308* line 39 - what's an "object-oriented monitor"?    Java?
309    there is no one OO model of such things.
311* line 47 significant asset - how do you know?
313* how could this e.g. build a reader/writer lock
315* *p15 what's the "bank account transfer problem"
317*p16 lines6-10  why? explain?
319*p17 semantics of arrays of conditions is unclear
320     given e.g. previous comments about arrays of mutexes.
322*p18 define "spurious wakeup"
324*p18 line 44 - "a number of approaches were examined"?  which
325 approaches? examined by whom?  if this is a novel contribution, needs
326 rather more there, and more comparison with related work
328* FIG 8 consider e.g. sequence diagrams rather than code to show these
329  cases
331* 6.2 p19 line 5 "similarly, monitor routines can be added at any
332  time" really?  I thought C-forall was compiled? there's a big
333  difference between "static" and "dynamic" inheritance. which is this
334  closer to?
336* line 25 "FIgure 9 (B) shows the monitor implementation"
337   I didn't understand this, especially not as an implementation.
339* section 6.6 - if the article is to make claims about completeness,
340  about supporting low and high level operations, then this must be
341  expanded to give enough detail to support that argument
343* "truest realization" huh?
345* section 7 should be merged into 6 or 8.
346  it's not clear if this is exploring rejected alternatives,
347  out outlining different features offered by C-forall, or what.
350* sec 7.2 how do the other threads in sections 5 & 6 relate to the
351  user threads, fibres, etc here;
353* sec 8.1 I found these sections hard to follow. how is a cluster a
354  "collection of threads and virtual processors... like a virtual
355  machine"? Where do the thread pools from 7.3 fit in?
357*  sec 8.3 is out of place, probably unneeded in the paper
359* section 9 dives straight into details with no overview.  Section 9
360  seems very detailed, and depends on assumptions or details that are
361  not in the article.
363* section 10 covers only microbenchmarks. are there any moderate sized
364  macrobenchmarks that can compare across the different systems?
365  (e.g the Erlang Ring?)
367* sec 11 claims that "the entire C-forall runtime system are written
368  in C-forall". The article doesn't
371* future work should precede conclusion, not follow it
373* the article should have a related work section (2-3 pages) comparing
374  the design overall with various competing designs (C++, Java, go,
375  Rust,...)
377To encourage accountability, I'm signing my reviews in 2018. For the record, I am James Noble,
379Reviewing: 3
381Comments to the Author
382This paper describes the design and implementation of coroutine- and thread-based concurrency in the C-for-all (I will write "C\/") system, a considerably extended form of the C language with many concurrency features.
384It first provides an overview of the non-concurrency-related aspects of the host language (references, operator overloading, generics, etc.), then addresses several technical issues around concurrency, including the multi-monitor design, bulk acquiring of locks (including deadlock-avoiding management of acquisition order), solutions to difficult scheduling problems around these, and implementation of monitors in the presence of separate compilation. It also presents empirical data showing the execution times of several microbenchmarks in comparison with other threaded concurrency systems, in support of the claim that the implementation is competitive with them.
386Overall the impression I gained is that this is a substantial system into which have gone much thought and effort.
388However, the present paper is not written so as to communicate sufficiently clearly the novel practices or experiences that emerged from that effort. This manifests itself in several ways.
390The system is described in general, rather than with a focus on novel insights or experiences. It was not until page 18 that I found a statement that hinted at a possible core contribution: "Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in design and implementation of C\/ concurrency." Even then, it is unclear whether such challenges have already been surmounted in prior systems, or what other challenges the paper may also be covering. The most complete list of claims appears to be in the Conclusion (section 11; oddly not the last section), although not everything listed is a novel feature of the work (e.g. N:M threading models are an old idea). This presentation needs to be completely inverted, to focus from the outset on the claimed novel/noteworthy experiences that the work embodies.
392The text describing the system's motivation is unconvincing on one point: the claim that library support for threading in C is "far from widespread" (p5, footnote A). The pthreads library API is standardised, albeit not in the C language specification but rather in POSIX -- a widespread standard indeed. (With systems languages, even if the language does not define a feature, it of course does not follow that that feature is not available -- since such languages permit extension of their own runtime and/or toolchain.) Of course, the combination of C and pthreads does not provide close to the full complement of C\/-supported features, so it is easy to make a case for C\/'s targeted "gap in the market". But again, a presentation focused on novel aspects would bring this out and enable the reader to learn from the authors' efforts much more readily.
394Certain sections of the text read like a tutorial on concurrency... which is potentially valuable, but does not seem to belong here. For example, much effort is spent introducing the notions of "synchronization" and "mutual exclusion", including the whole of Section 4.2. Presently it is unclear how this content supports the findings/experiences that the paper is detailing.
396Similarly, section 8 reads mostly as a basic introduction to user versus kernel threading implementations (including hybrid models such as N:M scheduling), and appears superfluous to this paper. Mixed into this are details of C\/'s specific approach. These could instead be stated directly, with references to handle the unlikely case where the reader is unfamiliar.
398I also found the definitions of certain terms through the paper a bit non-standard, for unclear reasons. For example, why "condition lock" rather than the standard "condition variable" (if indeed that is what is intended)? To say that "synchronisation" is about "timing" strikes me as potentially confusing, since in truth synchronisation concerns only relative timing, i.e. ordering. (Even ordering is something of a derived concept -- since of course, most commonly, control over ordering is built atop synchronisation primitives, rather than being provided directly by them.)
400The empirical data presented is a reasonable start at characterising the implementation's performance. However, it currently suffers certain flaws.
402Firstly, it is not clear what is being claimed. The data cannot really be said to "verify the implementation" (section 10). Presumably the claim is that the system is competitive with other systems offering reasonably high-level concurrency constructs (Java monitors, Go channels, etc.) and/or on low-level facilities (mutexes, coroutines). A claim of this form, emphasising the latter, does eventually appear in the Conclusion, but it needs to be made explicitly during the presentation of the experiments. Shifting the focus towards higher-level features may be a better target, since this appears to be C\/'s main advance over pthreads and similar libraries.
404It appears some additional or alternative competitor systems might be a better match. For example, many green-thread or N:M libraries for C exist (libdill/libmill, Marcel, even GNU Pth). It would be instructive to compare with these.
406It would help greatly if the "functionally identical" benchmark code that was run on the competing systems were made available somewhere. Omitting it from the main text of the paper is understandable, since it would take too much space, but its details may still have a critical bearing on the results.
408In some cases it simply wasn't clear what is being compared. In Table 3, what are "FetchAdd + FetchSub"? I'm guessing this is some open-coded mutex using C++ atomics, but (unless I'm missing something) I cannot see an explanation in the text.
410The reports of variance (or, rather, standard deviation) are not always plausible. Is there really no observable variation in three of Table 3's cases? At the least, I would appreciate more detail on the measures taken to reduce run-time variance (e.g. disabling CPU throttling perhaps?).
412The text habitually asserts the benefits of C\/'s design without convincing argument. For example, in 2.1, do C\/'s references really reduce "syntactic noise"? I am sympathetic to the problem here, because many design trade-offs simply cannot be evaluated without very large-scale or long-term studies. However, the authors could easily refrain from extrapolating to a grand claim that cannot be substantiated. For example, instead of saying C\/ is "expressive" or "flexible" or "natural", or (say) that fork/join concurrency is "awkward and unnecessary" (p11), it would be preferable simply to give examples of the cases are captured well in the C\/ design (ideally together with any less favourable examples that illustrate the design trade-off in question) and let them speak for themselves.
414One thing I found confusing in the presentation of coroutines is that it elides the distinction between "coroutines" (i.e. their definitions) and activations thereof. It would be helpful to make this clearer, since at present this makes some claims/statements hard to understand. For example, much of 3.2 talks about "adding fields", which implies that a coroutine's activation state exists as fields in a structured object -- as, indeed, it does in C\/. This is non-obvious because in a more classical presentation of coroutines, their state would live not in "fields" but in local variables. Similarly, the text also talks about composition of "coroutines" as fields within other "coroutines", and so on, whereas if I understand correctly, these are also activations. (By later on in the text, the "C\/ style" of such constructs is clear, but not at first.)
416I was expecting a reference to Adya et al's 2002 Usenix ATC paper, on the topic of "fibers" and cooperative threading generally but also for its illustrative examples of stack ripping (maybe around "linearized code is the bane of device drivers", p7, which seems to be making a similar observation).
418Minor comments:
420The writing is rather patchy. It has many typos, and also some cases of "not meaning what is said", unclear allusions, etc.. The following is a non-exhaustive list.
422- p2 line 7: "C has a notion of objects" -- true, but this is not intended as "object" in anything like the same sense as "object-oriented", so raising it here is somewhere between confusing and meaningless.
424- lots of extraneous hyphenation e.g "inheritance-relationships", "critical-section", "mutual-exclusion", "shared-state" (as a general rule, only hyphenate noun phrases when making an adjective out of them)
426- p4 "impossible in most type systems" -- this is not a property of the "type system" as usually understood, merely the wider language design
428- p17: "release all acquired mutex types in the parameter list" should just say "release all acquired mutexes that are designated in the parameter list" (it is not "types" that are being released or acquired);
430- p19: "a class includes an exhaustive list of operations" -- except it is definitively *not* exhaustive, for the reasons given immediately afterwards. I do see the problem here, about separate compilation meaning that the space of functions using a particular type is not bounded at compile time, but that needs to be identified clearly as the problem. (Incidentally, one idea is that perhaps this mapping onto a dense space could be solved at link- or load-time, in preference to run-time indirection.)
432- p22: in 6.5, the significance of this design decision ("threads... are monitors") was still not clear to me.
434- p22: [user threads are] "the truest realization of concurrency" sounds like unnecessary editorializing (many systems can exist that can also encode all others, without necessarily giving one supremacy... e.g. actors can be used to encode shared-state concurrency).
436- p24: on line 19, the necessary feature is not "garbage collection" but precise pointer identification (which is distinct; not all GCs have it, and it has other applications besides GC)
438- p24: lines 32-39 are very dense and of unclear significance; an example, including code, would be much clearer.
440- p25: "current UNIX systems" seems to mean "Linux", so please say that or give the behaviour or some other modern Unix (I believe Solaris is somewhat different, and possibly the BSDs too). Also, in the explanation of signal dynamics, it would be useful to adopt the quotation's own terminology of "process-directed" signals. Presumably the "internal" thread-directed signals were generated using tgkill()? And presumably the timer expiry signal is left unblocked only on the thread (virtual processor) running the "simulation"? (Calling it a "simulation" is a bit odd, although I realise it is borrowing the concept of a discrete event queue.)
Note: See TracBrowser for help on using the repository browser.