Date: Sun, 1 Jul 2018 12:31:39 +0000
From: Kevin Flores <onbehalfof@manuscriptcentral.com>
Reply-To: kflores@wiley.com
To: tdelisle@uwaterloo.ca, pabuhr@uwaterloo.ca
Subject: SPE-18-0205 successfully submitted
Feedback-ID: 1.us-west-2.y8789kd/oyDlGffYrP88IUZ2JiqaxJLDOSJ1+/ZSPI4=:AmazonSES

01-Jul-2018

Dear Dr Buhr,

Your manuscript entitled "Concurrency in C∀" has been received by Software:
Practice and Experience. It will be given full consideration for publication in
the journal.

Your manuscript number is SPE-18-0205.  Please mention this number in all
future correspondence regarding this submission.

You can view the status of your manuscript at any time by checking your Author
Center after logging into https://mc.manuscriptcentral.com/spe.  If you have
difficulty using this site, please click the 'Get Help Now' link at the top
right corner of the site.

Thank you for submitting your manuscript to Software: Practice and Experience.

Sincerely,

Software: Practice and Experience Editorial Office


Date: Wed, 3 Oct 2018 21:25:28 +0000
From: Richard Jones <onbehalfof@manuscriptcentral.com>
Reply-To: R.E.Jones@kent.ac.uk
To: tdelisle@uwaterloo.ca, pabuhr@uwaterloo.ca
Subject: Software: Practice and Experience - Decision on Manuscript ID
 SPE-18-0205

03-Oct-2018

Dear Dr Buhr,

Many thanks for submitting SPE-18-0205 entitled "Concurrency in C∀" to Software: Practice and Experience.

In view of the comments of the referees found at the bottom of this letter, I cannot accept your paper for publication in Software: Practice and Experience. I hope that you find the referees' very detailed comments helpful.

Thank you for considering Software: Practice and Experience for the publication of your research.  I hope the outcome of this specific submission will not discourage you from submitting future manuscripts.

Yours sincerely,


Prof. Richard Jones
Editor, Software: Practice and Experience
R.E.Jones@kent.ac.uk

Referee(s)' Comments to Author:

Reviewing: 1

Comments to the Author
"Concurrency in Cforall" presents a design and implementation of a set of standard concurrency features, including coroutines, user-space and kernel-space threads, mutexes, monitors, and a scheduler, for a polymorphic derivation of C called Cforall.

Section 2 is an overview of sequential Cforall that does not materially contribute to the paper. A brief syntax explanation where necessary in examples would be plenty.

Section 3 begins with with an extensive discussion of concurrency that also does not materially contribute to the paper. A brief mention of whether a particular approach implements cooperative or preemptive scheduling would be sufficient. Section 3 also makes some unfortunate claims, such as C not having threads -- C does in fact define threads, and this is noted as being true in a footnote, immediately after claiming that it does not. The question remains why the C11 parallelism design is insufficient and in what way this paper proposes to augment it. While I am personally a proponent of parallel programming languages, backing the assertion that all modern languages must have threading with citations from 2005 ignores the massive popularity of modern non-parallel languages (Javascript, node.js, Typescript, Python, Ruby, etc.) and parallel languages that are not thread based, although the authors are clearly aware of such approaches.

Sections 3.1 and 3.2 dicusses assymetric and symmetric coroutines. This also does not seem to materially contribute to a paper that is ostensibly about concurrency in a modern systems programming language. The area of coroutines, continuations, and generators is already well explored in the context of systems languages, including compilation techniques for these constructs that are more advanced than the stack instantiation model discussed in the paper.

Section 3.3 describes threads in Cforall, briefly touching on user-space vs. kernel-space thread implementations without detailing the extensive practical differences. It is unclear how the described interface differes from C++11 threads, as the description seems to center on an RAII style approach to joining in the destructor.

Section 4 briefly touches on a collection of well known synchronisation primitives. Again, this discussion does not materially contribute to the paper.

Section 5 describes monitors, which are a well known and well researched technique. The Cforall implementation is unsurprising. The "multi-acquire semantics" described are not a contribution of this paper, as establishing a stable order for lock acquisition is a well known technique, one example of which is the C++ std::scoped_lock.

Section 6 is a discussion of scheduling that does not appear to be informed by the literature. There is no discussion of work-stealing vs. work-scheduling, static vs. dynamic priorities, priority inversion, or fairness. There is a claim in secion 6.1 for a novel technique, partial signalling, that appears to be a form of dynamic priority, but no comparison is made. In section 6.6, a very brief mention of other synchronisation techniques is made, without reference to current techniques such as array-based locks, CLH or MCS queue locks, RCU and other epoch-based mechanisms, etc. Perhaps these are considered out of scope.

Section 7 discusses parallelism, but does not materially contribute to the paper. It is claimed that preemption is necessary to implement spinning, which is not correct, since two cores can implement a spinning based approach without preemption. It is claimed that with thread pools "concurrency errors return", but no approach to removing concurrency errors with either preemptive or cooperatively scheduled user threads has been proposed in the paper that would not also apply to thread pools.

Section 8 is intended to describe the Cforall runtime structure, but does so in a way that uses terminology in an unfamiliar way. The word cluster is more usually used in distributed systems, but here refers to a process. The term virtual processor is more usually used in hardware virtualisation, but here refers to a kernel thread. The term debug kernel is more usually used in operating systems to refer to kernels that have both debug info and a method for using a debugger in kernel space, but here refers to a debug build of a user-space process. This section does not materially contribute to the paper.

Section 9 is intended to describe the Cforall runtime implementation. It makes some unusual claims, such as C libraries migrating to stack chaining (stack chaining was an experimental GCC feature that has been abandoned, much as it has been abandoned in both Go and Rust).

The performance measurements in section 10 are difficult to evaluate. While I appreciate that comparable concurrency benchmarks are very difficult to write, and the corpus of existing benchmarks primarily boils down to the parallel programs in the Computer Language Benchmark Game, the lack of detail as to what is being measured in these benchmarks (particularly when implemented in other languages) is unfortunate. For example, in table 3, the benchmark appears to measure uncontended lock access, which is not a useful micro-benchmark.

It is not clear what the contributions of this paper are intended to be. A concise listing of the intended contributions would be helpful. Currently, it appears that the paper makes neither PL contributions in terms of novel features in Cforall, nor does it make systems contributions in terms of novel features in the runtime.


Reviewing: 2

Comments to the Author
This article presents the design and rationale behind the concurrency
features of C-forall, a new low-level programming language.  After an
introduction that defines a selection of standard terminology, section
2 gives crucial background on the design of the C-forall language.
Section 3 then starts the core of the article, discussing the
language's support for "concurrency" which in this case means
coroutines and threads; a very brief Section 4 builds on section 3
with a discussion of lower level synchronizations.  Section 5 the
presents the main features of concurrency control in C-forall:
monitors and mutexes. Section 6 then extends monitors with condition
variables to to support scheduling, and a very brief section 7
discusses preemption and pooling. Section 8 discusses the runtime
conceptual model, section 9 gives implementation detail, and section
10 briefly evaluates C-forall's performance via five concurrent
micro benchmarks. Finally section 11 concludes the article, and then
section 12 presents some future work.  


At the start of section 7, article lays out its rationale: that while
"historically, computer performance was about processor speeds" but
"Now, high-performance applications must care about parallelism,
which requires concurrency". The doomsayers trumpeting the death of
Moore's law have been proved correct at last, with CPUs sequential
performance increasing much more slowly than the number of cores
within each die. This means programmers --- especially low-level,
systems programmers --- must somehow manage the essential complexity
of writing concurrent programs to run in parallel in multiple threads
across multiple cores. Unfortunately, the most venerable widely used
systems programming language, C, supports parallelism only via an
e.g. the threads library.  This article aims to integrate concurrent
programming mechanisms more closely into a novel low-level C-based
programming language, C-forall. The article gives an outline of much of
C-forall, presents a series of concurrency mechanisms, and finally
some microbenchmark results.  The article is detailed, comprehensive,
and generally well written in understandable English.

My main concern about the article are indicated by the fact that the
best summary of the problem the design of concurrent C-forall sets
out to solve is buried more than halfway through the article in section
7, as above, and then the best overview of the proposed solution is
given in the 2nd, 4th and 5th sentence of the conclusion:

   "The approach provides concurrency based on a preemptive M:N
    user-level threading-system, executing in clusters, which
    encapsulate scheduling of work on multiple kernel threads
    providing parallelism... High-level objects (monitor/task) are the
    core mechanism for mutual exclusion and synchronization. A novel
    aspect is allowing multiple mutex-objects to be accessed
    simultaneously reducing the potential for deadlock for this
    complex scenario."

That is, in my reading of the article, it proceeds bottom up rather
than top down, and so my main recommendation is to essentially reverse
the order of the article, proceeding from the problem to be solved,
the high level architecture of the proposed solutions, and then going
down to the low-level mechanisms.  My biggest problem reading the
article was for explanations of why a particular decision was taken,
or why a particular mechanism may be used --- often this description
is actually later in the article, but at that point it's too late for
the reader.  I have tried to point out most of these places in the
detailed comments below.

My second concern is that the article makes several claims that are
not really justified by the design or implementation in the article.
These include claims that this approach meets the expectations of C
programmers, is minimal, is implemented in itself, etc.  The article
doesn't generally offer evidence to support these assertions (for many
of them, that would require empirical studies of programmers, or at
least corpus studies). The solution here is to talk about motivations
for the design choices "we made these decisions hoping that C
programmers would be comfortable" rather than claims of fact "C
programmers are comfortable".  Again I attempt to point these out below.

* abstract: needs to characterize the work top down, and not make
  claims "features respect the expectations of C programmers" that
  are not supported empirically.

* p1 line 14 "integrated"

* introduction needs to introduce the big ideas and scope of the
  article, not define terms.  Some of the terms / distinctions are
  non-standard (e.g. the distinction between "concurrency" and
  "parallelism") and can be avoided by using more specific terms
  (mutual exclusion, synchronization, parallel execution. etc).

* to me this article introduces novel language features, not just an
  API.  Similarly, it doesn't talk about any additions "to the
  language translator" - i.e compiler changes! - rather about language
  features.


* section 2 lines 6-9 why buy this fight against object-orientation?
  this article doesn't need to make this argument, but needs to do a
  better job of it if it does (see other comments below)

* sec 2.1 - are these the same as C++. IF so, say so, if not, say why
  not.

* 2.2 calling it a "with statement" was confusing, given that a with
  clause can appear in a routine declaration with a shorthand syntax.

* 2.3 again compare with C++ and Java (as well as Ada)

* line 9 "as we will see in section 3"

* 2.4 I really quite like this syntax for operators, destructors not
  so much.

* 2.5 and many places elsewhere. Always first describe the semantics
  of your language constructs, then describe their properties, then
  compare with e.g. related languages (mostly C++ & Java?).  E.g in
  this case, something like:

  "C-forall includes constructors, which are called to initialize
  newly allocated objects, and constructors, which are called when
  objects are deallocated. Constructors and destructors are written as
  functions returning void, under the special names "?{}" for
  constructors and "^{}" for destructors: constructors may be
  overridden, but destructors may not be.  The semantics of C-forall's
  constructors and destructors are essentially those of C++."

  this problem repeats many times throughout the article and should be
  fixed everywhere.


* 2.6 again, first describe then properties then comparison.
   in this case, compare e.g. with C++ templates, Java/Ada generics
   etc.

* why special case forward declarations? It's not 1970 any more.

* what are traits?  structural interfaces (like Go interfaces) or
  nominal bindings?

* section 3 - lines 2-30, also making very specific global definitions
  as in the introduction. The article does not need to take on this
  fight either, rather make clear that this is the conceptual model in
  C-forall. (If the article starts at the top and works down, that may
  well follow anyway).

* "in modern programming languages... unacceptable"; "in a
  system-level language.. concurrent programs should be written with
  high-level features" - again, no need to take on these fights.

* 3.1 onwards; I found all this "building" up hard to follow.
  also it's not clear a "minimal" API must separately support
  coroutines, threads, fibres, etc

* FIG 2B - where's the output?
  syntax "sout | next(f1) | next(f2) | endl" nowhere explained
    why not use C++s' << and >>

* FIG 3 be clearer, earlier about the coroutine" constructor syntax

** ensure all figures are placed *after* their first mention in the
   text. consider interleaving smaller snippets of text rather than
   just referring to large figures

* sec 3.1 p7 etc,. need more context / comparison e.g. Python
  generators etc.

* FIGURE 4 is this right?  should there a constructor for Cons taking
  a Prod?


* sec 3.2 order of constructors depends on the language.  more
  generally, if the article is going to make arguments against OO
  (e.g. section 2) then the article needs to explain, in detail, why
  e.g. coroutine, thread, etc *cannot* be classes / objects.

* "type coroutine_t must be an abstract handle.. descriptor and is
  stack are non-copyable" - too many assumptions in here (and other
  similar passages) that are not really spelled out in detail.

* p10 line 4 introduces "coroutine" keyword. needs to give its
  semantics. also needs to introduce and define properties and compare
  before all the examples using coroutines.

* p10 again, trait semantics need to be better defined 

* 3.3 should be an introduction to this section. Note that section
  titles are not part of the text of the article.

* what's the difference between "coroutines" and "user threads" (and
  "fibres?")

* what's a "task type" or an "interface routine"  or "underlying
  thread"

* section 4 - "... meaningless". nope some semantics are possible
  e.g. if there's a memory model.

* whatare "call/return based languages"

* p12 - what if a programmer wants to join e.g. "1st of N" or "1st 3 of N"
  threads rather than all threads in order

* 4.1 p12 13-25, again it's not clear where this is going.  presenting the model
  top down may hopefully resolve this

* section 4 should be merged e.g. into sec 3 (or 5)


* section 5 p13 what's "routine" scope. "call/return paradigm"

* thread/ coroutine declarations, traits etc, all look pretty close to
  inheritance. why wouldn't inheritance work?

* open/closed locks = free/acquired free locks?

* testability?

* p14 lines 14-20 I had trouble following this.  e.g/. what's the
  difference between "a type that is a monitor" and "a type that looks
  like a monitor"?  why?

* line 39 - what's an "object-oriented monitor"?    Java?
    there is no one OO model of such things.

* line 47 significant asset - how do you know?

* how could this e.g. build a reader/writer lock

* *p15 what's the "bank account transfer problem"

*p16 lines6-10  why? explain?

*p17 semantics of arrays of conditions is unclear
     given e.g. previous comments about arrays of mutexes.

*p18 define "spurious wakeup"

*p18 line 44 - "a number of approaches were examined"?  which
 approaches? examined by whom?  if this is a novel contribution, needs
 rather more there, and more comparison with related work 

* FIG 8 consider e.g. sequence diagrams rather than code to show these
  cases

* 6.2 p19 line 5 "similarly, monitor routines can be added at any
  time" really?  I thought C-forall was compiled? there's a big
  difference between "static" and "dynamic" inheritance. which is this
  closer to?

* line 25 "FIgure 9 (B) shows the monitor implementation"
   I didn't understand this, especially not as an implementation.

* section 6.6 - if the article is to make claims about completeness,
  about supporting low and high level operations, then this must be
  expanded to give enough detail to support that argument

* "truest realization" huh?

* section 7 should be merged into 6 or 8.
  it's not clear if this is exploring rejected alternatives,
  out outlining different features offered by C-forall, or what.


* sec 7.2 how do the other threads in sections 5 & 6 relate to the
  user threads, fibres, etc here;

* sec 8.1 I found these sections hard to follow. how is a cluster a
  "collection of threads and virtual processors... like a virtual
  machine"? Where do the thread pools from 7.3 fit in?

*  sec 8.3 is out of place, probably unneeded in the paper

* section 9 dives straight into details with no overview.  Section 9
  seems very detailed, and depends on assumptions or details that are
  not in the article.

* section 10 covers only microbenchmarks. are there any moderate sized
  macrobenchmarks that can compare across the different systems?
  (e.g the Erlang Ring?)

* sec 11 claims that "the entire C-forall runtime system are written
  in C-forall". The article doesn't


* future work should precede conclusion, not follow it

* the article should have a related work section (2-3 pages) comparing
  the design overall with various competing designs (C++, Java, go,
  Rust,...)

To encourage accountability, I'm signing my reviews in 2018. For the record, I am James Noble, kjx@ecs.vuw.ac.nz. 

Reviewing: 3

Comments to the Author
This paper describes the design and implementation of coroutine- and thread-based concurrency in the C-for-all (I will write "C\/") system, a considerably extended form of the C language with many concurrency features.

It first provides an overview of the non-concurrency-related aspects of the host language (references, operator overloading, generics, etc.), then addresses several technical issues around concurrency, including the multi-monitor design, bulk acquiring of locks (including deadlock-avoiding management of acquisition order), solutions to difficult scheduling problems around these, and implementation of monitors in the presence of separate compilation. It also presents empirical data showing the execution times of several microbenchmarks in comparison with other threaded concurrency systems, in support of the claim that the implementation is competitive with them.

Overall the impression I gained is that this is a substantial system into which have gone much thought and effort.

However, the present paper is not written so as to communicate sufficiently clearly the novel practices or experiences that emerged from that effort. This manifests itself in several ways.

The system is described in general, rather than with a focus on novel insights or experiences. It was not until page 18 that I found a statement that hinted at a possible core contribution: "Supporting barging prevention as well as extending internal scheduling to multiple monitors is the main source of complexity in design and implementation of C\/ concurrency." Even then, it is unclear whether such challenges have already been surmounted in prior systems, or what other challenges the paper may also be covering. The most complete list of claims appears to be in the Conclusion (section 11; oddly not the last section), although not everything listed is a novel feature of the work (e.g. N:M threading models are an old idea). This presentation needs to be completely inverted, to focus from the outset on the claimed novel/noteworthy experiences that the work embodies.

The text describing the system's motivation is unconvincing on one point: the claim that library support for threading in C is "far from widespread" (p5, footnote A). The pthreads library API is standardised, albeit not in the C language specification but rather in POSIX -- a widespread standard indeed. (With systems languages, even if the language does not define a feature, it of course does not follow that that feature is not available -- since such languages permit extension of their own runtime and/or toolchain.) Of course, the combination of C and pthreads does not provide close to the full complement of C\/-supported features, so it is easy to make a case for C\/'s targeted "gap in the market". But again, a presentation focused on novel aspects would bring this out and enable the reader to learn from the authors' efforts much more readily.

Certain sections of the text read like a tutorial on concurrency... which is potentially valuable, but does not seem to belong here. For example, much effort is spent introducing the notions of "synchronization" and "mutual exclusion", including the whole of Section 4.2. Presently it is unclear how this content supports the findings/experiences that the paper is detailing.

Similarly, section 8 reads mostly as a basic introduction to user versus kernel threading implementations (including hybrid models such as N:M scheduling), and appears superfluous to this paper. Mixed into this are details of C\/'s specific approach. These could instead be stated directly, with references to handle the unlikely case where the reader is unfamiliar.

I also found the definitions of certain terms through the paper a bit non-standard, for unclear reasons. For example, why "condition lock" rather than the standard "condition variable" (if indeed that is what is intended)? To say that "synchronisation" is about "timing" strikes me as potentially confusing, since in truth synchronisation concerns only relative timing, i.e. ordering. (Even ordering is something of a derived concept -- since of course, most commonly, control over ordering is built atop synchronisation primitives, rather than being provided directly by them.)

The empirical data presented is a reasonable start at characterising the implementation's performance. However, it currently suffers certain flaws.

Firstly, it is not clear what is being claimed. The data cannot really be said to "verify the implementation" (section 10). Presumably the claim is that the system is competitive with other systems offering reasonably high-level concurrency constructs (Java monitors, Go channels, etc.) and/or on low-level facilities (mutexes, coroutines). A claim of this form, emphasising the latter, does eventually appear in the Conclusion, but it needs to be made explicitly during the presentation of the experiments. Shifting the focus towards higher-level features may be a better target, since this appears to be C\/'s main advance over pthreads and similar libraries.

It appears some additional or alternative competitor systems might be a better match. For example, many green-thread or N:M libraries for C exist (libdill/libmill, Marcel, even GNU Pth). It would be instructive to compare with these.

It would help greatly if the "functionally identical" benchmark code that was run on the competing systems were made available somewhere. Omitting it from the main text of the paper is understandable, since it would take too much space, but its details may still have a critical bearing on the results.

In some cases it simply wasn't clear what is being compared. In Table 3, what are "FetchAdd + FetchSub"? I'm guessing this is some open-coded mutex using C++ atomics, but (unless I'm missing something) I cannot see an explanation in the text.

The reports of variance (or, rather, standard deviation) are not always plausible. Is there really no observable variation in three of Table 3's cases? At the least, I would appreciate more detail on the measures taken to reduce run-time variance (e.g. disabling CPU throttling perhaps?).

The text habitually asserts the benefits of C\/'s design without convincing argument. For example, in 2.1, do C\/'s references really reduce "syntactic noise"? I am sympathetic to the problem here, because many design trade-offs simply cannot be evaluated without very large-scale or long-term studies. However, the authors could easily refrain from extrapolating to a grand claim that cannot be substantiated. For example, instead of saying C\/ is "expressive" or "flexible" or "natural", or (say) that fork/join concurrency is "awkward and unnecessary" (p11), it would be preferable simply to give examples of the cases are captured well in the C\/ design (ideally together with any less favourable examples that illustrate the design trade-off in question) and let them speak for themselves.

One thing I found confusing in the presentation of coroutines is that it elides the distinction between "coroutines" (i.e. their definitions) and activations thereof. It would be helpful to make this clearer, since at present this makes some claims/statements hard to understand. For example, much of 3.2 talks about "adding fields", which implies that a coroutine's activation state exists as fields in a structured object -- as, indeed, it does in C\/. This is non-obvious because in a more classical presentation of coroutines, their state would live not in "fields" but in local variables. Similarly, the text also talks about composition of "coroutines" as fields within other "coroutines", and so on, whereas if I understand correctly, these are also activations. (By later on in the text, the "C\/ style" of such constructs is clear, but not at first.)

I was expecting a reference to Adya et al's 2002 Usenix ATC paper, on the topic of "fibers" and cooperative threading generally but also for its illustrative examples of stack ripping (maybe around "linearized code is the bane of device drivers", p7, which seems to be making a similar observation).

Minor comments:

The writing is rather patchy. It has many typos, and also some cases of "not meaning what is said", unclear allusions, etc.. The following is a non-exhaustive list.

- p2 line 7: "C has a notion of objects" -- true, but this is not intended as "object" in anything like the same sense as "object-oriented", so raising it here is somewhere between confusing and meaningless.

- lots of extraneous hyphenation e.g "inheritance-relationships", "critical-section", "mutual-exclusion", "shared-state" (as a general rule, only hyphenate noun phrases when making an adjective out of them)

- p4 "impossible in most type systems" -- this is not a property of the "type system" as usually understood, merely the wider language design

- p17: "release all acquired mutex types in the parameter list" should just say "release all acquired mutexes that are designated in the parameter list" (it is not "types" that are being released or acquired);

- p19: "a class includes an exhaustive list of operations" -- except it is definitively *not* exhaustive, for the reasons given immediately afterwards. I do see the problem here, about separate compilation meaning that the space of functions using a particular type is not bounded at compile time, but that needs to be identified clearly as the problem. (Incidentally, one idea is that perhaps this mapping onto a dense space could be solved at link- or load-time, in preference to run-time indirection.)

- p22: in 6.5, the significance of this design decision ("threads... are monitors") was still not clear to me.

- p22: [user threads are] "the truest realization of concurrency" sounds like unnecessary editorializing (many systems can exist that can also encode all others, without necessarily giving one supremacy... e.g. actors can be used to encode shared-state concurrency).

- p24: on line 19, the necessary feature is not "garbage collection" but precise pointer identification (which is distinct; not all GCs have it, and it has other applications besides GC)

- p24: lines 32-39 are very dense and of unclear significance; an example, including code, would be much clearer.

- p25: "current UNIX systems" seems to mean "Linux", so please say that or give the behaviour or some other modern Unix (I believe Solaris is somewhat different, and possibly the BSDs too). Also, in the explanation of signal dynamics, it would be useful to adopt the quotation's own terminology of "process-directed" signals. Presumably the "internal" thread-directed signals were generated using tgkill()? And presumably the timer expiry signal is left unblocked only on the thread (virtual processor) running the "simulation"? (Calling it a "simulation" is a bit odd, although I realise it is borrowing the concept of a discrete event queue.)