Context Navigation

response2@ b54118a

Visit:

ADT arm-eh ast-experimental enum forall-pointer-decay jacob/cs343-translation new-ast new-ast-unique-expr pthread-emulation qualifiedEnum

Last change on this file since b54118a was b54118a, checked in by Peter A. Buhr <pabuhr@…>, 5 years ago
update referee response
Property mode set to `100644`
File size: 46.8 KB

Rev	Line
[04b4a71]	1	Reviewing: 1
	2
	3	I still have a couple of issues --- perhaps the largest is that it's
	4	still not clear at this point in the paper what some of these options
	5	are, or crucially how they would be used. I don't know if it's
	6	possible to give high-level examples or use cases to be clear about
	7	these up front - or if that would duplicate too much information from
	8	later in the paper - either way expanding out the discussion - even if
	9	just two a couple of sentences for each row - would help me more.
	10
	11	Section 2.1 is changed to address this suggestion.
	12
	13
	14	* 1st para section 2 begs the question: why not support each
	15	dimension independently, and let the programmer or library designer
	16	combine features?
	17
	18	As seen in Table 1, not all of the combinations work, and having programmers
	19	directly use these low-level mechanisms is error prone. Accessing these
	20	fundamental mechanisms through higher-level constructs has always been the
	21	purpose of a programming language.
	22
	23
	24	* Why must there "be language mechanisms to create, block/unblock, and join
	25	with a thread"? There aren't in Smalltalk (although there are in the
	26	runtime). Especially given in Cforall those mechanisms are implicit on
	27	thread creation and destruction?
	28
	29	The best description of Smalltalk concurrency I can find is in J. Hunt,
	30	Smalltalk and Object Orientation, Springer-Verlag London Limited, 1997, Chapter
	31	31 Concurrency in Smalltalk. It states on page 332:
	32
	33	For a process to be spawned from the current process there must be some way
	34	of creating a new process. This is done using one of four messages to a
	35	block. These messages are:
	36
	37	aBlock fork: This creates and schedules a process which will execute the
	38	block. The priority of this process is inherited from the parent process.
	39	...
	40
	41	The Semaphore class provides facilities for achieving simple synchronization,
	42	it is simple because it only allows for two forms of communication signal and
	43	wait.
	44
	45	Hence, "aBlock fork" creates, "Semaphore" blocks/unblocks (as does message send
	46	to an aBlock object), and garbage collection of an aBlock joins with its
	47	thread. The fact that a programmer implicitly does "fork", "block"/"unblock",
	48	and "join", does not change their fundamental requirement.
	49
	50
	51	* "Case 1 is a function that borrows storage for its state (stack
	52	frame/activation) and a thread from its invoker"
	53
	54	this much makes perfect sense to me, but I don't understand how a
	55	non-stateful, non-threaded function can then retain
	56
	57	"this state across callees, ie, function local-variables are
	58	retained on the stack across calls."
	59
	60	how can it retain function-local values across calls when it
	61	doesn't have any functional-local state?
	62
	63	In the following example:
	64
	65	void foo() {
	66	// local variables and code
	67	}
	68	void bar() {
	69	// local variables
	70	foo();
	71	}
	72
	73	bar is the caller and foo is the callee. bar borrows the program stack and
	74	thread to make the call to foo. When foo, the callee, is executing, bar's local
	75	variables (state) is retained on the borrowed stack across the call. (Note, I
	76	added borrowed to that sentence in the paper to help clarify.) Furthermore,
	77	foo's local variables are also retain on the borrowed stack. When foo and bar
	78	return, all of their local state is gone (not retained). This behaviour is
	79	standard call/return semantics in an imperative language.
	80
	81
	82	I'm not sure if I see two separate cases here - roughly equivalent
	83	to C functions without static storage, and then C functions with
	84	static storage.
	85
	86	Yes, but there is only one instance of the static storage across all
	87	activations of the C function. For generators and coroutines, each instance has
	88	its own state, like an object in an OO language.
	89
	90
	91	I assumed that was the distinction between cases 1 & 3; but perhaps the
	92	actual distinction is that 3 has a suspend/resume point, and so the
	93	"state" in figure 1 is this component of execution state (viz figs 1 & 2),
	94	not the state representing the cross-call variables?
	95
	96	So case 3 is like an object with the added ability to retain where it was last
	97	executing. When a generator is resumed, the generator object (structure
	98	instance) is passed as an explicit reference, and within this object is the
	99	restart location in the generator's "main". When the generator main executes,
	100	it uses the borrowed stack for its local variables and any functions it calls,
	101	just like an object member borrows the stack for its local variables but also
	102	has an implicit receiver to the object state. A generator can have static
	103	storage, too, which is a single instance across all generator instances of that
	104	type, as for static storage in an object type. All the kinds of storage are
	105	at play with semantics that is virtually the same as in other languages.
	106
	107
	108	> but such evaluation isn't appropriate for garbage-collected or JITTed
	109	> languages like Java or Go.
	110	For JITTed languages in particular, reporting peak performance needs to
	111	"warm up" the JIT with a number of iterators before beginning
	112	measurement. Actually for JIT's its even worse: see Edd Barrett et al
	113	OOPSLA 2017.
	114
[27125d0]	115	Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
	116	test-programs correctly measured the specific feature, we consulted with Dave
	117	Dice at Oracle who works directly on the development of the Oracle JVM
	118	Just-in-Time Compiler. We modified our test programs based on his advise, and
	119	he validated our programs as correctly measuring the specified language
	120	feature. Hence, we have taken into account all issues related to performing
	121	benchmarks in JITTED languages. Dave's help is recognized in the
	122	Acknowledgment section. Also, all the benchmark programs are publicly available
	123	for independent verification.
	124
	125	Similarly, we verified our Node.js programs with Gregor Richards, an expert in
	126	just-in-time compilation for dynamic typing.
[04b4a71]	127
	128
	129	* footnote A - I've looked at various other papers & the website to try to
	130	understand how "object-oriented" Cforall is - I'm still not sure. This
	131	footnote says Cforall has "virtuals" - presumably virtual functions,
	132	i.e. dynamic dispatch - and inheritance: that really is OO as far as I
	133	(and most OO people) are concerned. For example Haskell doesn't have
	134	inheritance, so it's not OO; while CLOS (the Common Lisp Object System)
	135	or things like Cecil and Dylan are considered OO even though they have
	136	"multiple function parameters as receivers", lack "lexical binding between
	137	a structure and set of functions", and don't have explicit receiver
	138	invocation syntax. Python has receiver syntax, but unlike Java or
	139	Smalltalk or C++, method declarations still need to have an explicit
	140	"self" receiver parameter. Seems to me that Go, for example, is
	141	more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
	142	and an explicit receiver syntax but that's not determinative); while Rust
	143	lacks dynamic dispatch built-in. C is not OO as a language, but as you
	144	say given it supports function pointers with structures, it does support
	145	an OO programming style.
	146
	147	This is why I again recommend just not buying into this fight: not making
	148	any claims about whether Cforall is OO or is not - because as I see it,
	149	the rest of the paper doesn't depend on whether Cforall is OO or not.
	150	That said: this is just a recommendation, and I won't quibble over this
	151	any further.
	152
	153	We believe it is important to identify Cforall as a non-OO language because it
	154	heavily influences the syntax and semantics used to build its concurrency.
	155	Since many aspects of Cforall are not OO, the rest of the paper does depend
	156	on Cforall being identified as non-OO, otherwise readers would have
	157	significantly different expectations for the design. We believe your definition
	158	of OO is too broad, such as including C. Just because a programming language
	159	can support aspects of the OO programming style, does not make it OO. (Just
	160	because a duck can swim does not make it a fish.)
	161
	162	Our definition of non-OO follows directly from the Wikipedia entry:
	163
	164	Object-oriented programming (OOP) is a programming paradigm based on the
	165	concept of "objects", which can contain data, in the form of fields (often
	166	known as attributes or properties), and code, in the form of procedures
	167	(often known as methods). A feature of objects is an object's procedures that
	168	can access and often modify the data fields of the object with which they are
	169	associated (objects have a notion of "this" or "self").
	170	https://en.wikipedia.org/wiki/Object-oriented_programming
	171
	172	Cforall fails this definition as code cannot appear in an "object" and there is
	173	no implicit receiver. As well, Cforall, Go, and Rust do not have nominal
	174	inheritance and they not considered OO languages, e.g.:
	175
	176	"Is Go an object-oriented language? Yes and no. Although Go has types and
	177	methods and allows an object-oriented style of programming, there is no type
	178	hierarchy. The concept of "interface" in Go provides a different approach
	179	that we believe is easy to use and in some ways more general. There are also
	180	ways to embed types in other types to provide something analogous-but not
	181	identical-to subclassing. Moreover, methods in Go are more general than in
	182	C++ or Java: they can be defined for any sort of data, even built-in types
	183	such as plain, "unboxed" integers. They are not restricted to structs (classes).
	184	https://golang.org/doc/faq#Is_Go_an_object-oriented_language
	185
	186
	187	* is a "monitor function" the same as a "mutex function"?
	188	if so the paper should pick one term; if not, make the distinction clear.
	189
	190	Fixed. Picked "mutex". Changed the language and all places in the paper.
	191
	192
	193	* "As stated on line 1 because state declarations from the generator
	194	type can be moved out of the coroutine type into the coroutine main"
	195
	196	OK sure, but again: why would a programmer want to do that?
	197	(Other than, I guess, to show the difference between coroutines &
	198	generators?) Perhaps another way to put this is that the first
	199	para of 3.2 gives the disadvantages of coroutines vs-a-vs
	200	generators, briefly describes the extended semantics, but never
	201	actually says why a programmer may want those extended semantics,
	202	or how they would benefit. I don't mean to belabour the point,
	203	but (generalist?) readers like me would generally benefit from
	204	those kinds of discussions about each feature throughout the
	205	paper: why might a programmer want to use them?
	206
	207	On page 8, it states:
	208
	209	Having to manually create the generator closure by moving local-state
	210	variables into the generator type is an additional programmer burden (removed
	211	by the coroutine in Section 3.2). ...
	212
	213	also these variables can now be refactored into helper function, where the
	214	helper function suspends at arbitrary call depth. So imagine a coroutine helper
	215	function that is only called occasionally within the coroutine but it has a
	216	large array that is retained across suspends within the helper function. For a
	217	generator, this large array has to be declared in the generator type enlarging
	218	each generator instance even through the array is only used occasionally.
	219	Whereas, the coroutine only needs the array allocated when needed. Now a
	220	coroutine has a stack which occupies storage, but the maximum stack size only
	221	needs to be the call chain allocating the most storage, where as the generator
	222	has a maximum size of all variable that could be created.
	223
	224
	225	> p17 if the multiple-monitor entry procedure really is novel, write a paper
	226	> about that, and only about that.
	227	> We do not believe this is a practical suggestion.
	228	* I'm honestly not trying to be snide here: I'm not an expert on monitor or
	229	concurrent implementations. Brinch Hansen's original monitors were single
	230	acquire; this draft does not cite any other previous work that I could
	231	see. I'm not suggesting that the brief mention of this mechanism
	232	necessarily be removed from this paper, but if this is novel (and a clear
	233	advance over a classical OO monitor a-la Java which only acquires the
	234	distinguished receiver) then that would be worth another paper in itself.
	235
	236	First, to explain multiple-monitor entry in Cforall as a separate paper would
	237	require significant background on Cforall concurrency, which means repeating
	238	large sections of this paper. Second, it feels like a paper just on
	239	multiple-monitor entry would be a little thin, even if the capability is novel
	240	to the best of our knowledge. Third, we feel multiple-monitor entry springs
	241	naturally from the overarching tone in the paper that Cforall is a non-OO
	242	programming language, allowing multiple-mutex receivers.
	243
	244
	245	My typo: the paper's conclusion should come at the end, after the
	246	future work section.
	247
	248	Combined into a Conclusions and Future Work section.
	249
	250
	251
	252	Reviewing: 2
	253
	254	on the Boehm paper and whether code is "all sequential to the compiler": I
	255	now understand the authors' position better and suspect we are in violent
	256	agreement, except for whether it's appropriate to use the rather breezy
	257	phrase "all sequential to the compiler". It would be straightforward to
	258	clarify that code not using the atomics features is optimized as if it
	259	were sequential, i.e. on the assumption of a lack of data races.
	260
	261	Fixed, "as inline and library code is compiled as sequential without any
	262	explicit concurrent directive."
	263
	264
	265	on the distinction between "mutual exclusion" and "synchronization": the
	266	added citation does help, in that it makes a coherent case for the
	267	definition the authors prefer. However, the text could usefully clarify
	268	that this is a matter of definition not of fact, given especially that in
	269	my assessment the authors' preferred definition is not the most common
	270	one. (Although the mention of Hoare's apparent use of this definition is
	271	one data point, countervailing ones are found in many contemporaneous or
	272	later papers, e.g. Habermann's 1972 "Synchronization of Communicating
	273	Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
	274	eventcounts and sequencers" (CACM (22(2)) and so on.)
	275
	276	Fixed, "We contend these two properties are independent, ...".
	277
	278	With respect to the two papers, Habermann fundamentally agrees with our
	279	definitions, where the term mutual exclusion is the same but Habermann uses the
	280	term "communication" for our synchronization. However, the term "communication"
	281	is rarely used to mean synchronization. The fact that Habermann collectively
	282	calls these two mechanisms synchronization is the confusion. Reed & Kanodia
	283	state:
	284
	285	By mutual exclusion we mean any mechanism that forces the time ordering of
	286	execution of pieces of code, called critical regions, in a system of
	287	concurrent processes to be a total ordering.
	288
	289	But there is no timing order for a critical region (which I assume means
	290	critical section); threads can arrive at any time in any order, where the
	291	mutual exclusion for the critical section ensures one thread executes at a
[27125d0]	292	time. Interestingly, Reed & Kanodia's critical region is Habermann's
	293	communication not critical section. These papers only buttress our contention
[04b4a71]	294	about the confusion of these terms in the literature.
	295
	296
	297	section 2 (an expanded version of what was previously section 5.9) lacks
	298	examples and is generally obscure and allusory ("the most advanced feature"
	299	-- name it! "in triplets" -- there is only one triplet!;
	300
	301	Fixed.
	302
	303	These independent properties can be used to compose different language
	304	features, forming a compositional hierarchy, where the combination of all
	305	three is the most advanced feature, called a thread/task/process. While it is
	306	possible for a language to only provide threads for composing
	307	programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
	308	solutions to certain classes of problems.
	309
	310
	311	what are "execution locations"? "initialize" and "de-initialize"
	312
	313	Fixed through an example at the end of the sentence.
	314
	315	\item[\newterm{execution state}:] is the state information needed by a
	316	control-flow feature to initialize, manage compute data and execution
	317	location(s), and de-initialize, \eg calling a function initializes a stack
	318	frame including contained objects with constructors, manages local data in
	319	blocks and return locations during calls, and de-initializes the frame by
	320	running any object destructors and management operations.
	321
	322
	323	what? "borrowed from the invoker" is a concept in need of explaining or at
	324	least a fully explained example -- in what sense does a plain function
	325	borrow" its stack frame?
	326
	327	A function has no storage except for static storage; it gets its storage from
	328	the program stack. For a function to run it must borrow storage from somewhere.
	329	When the function returns, the borrowed storage is returned. Do you have a more
	330	appropriate word?
	331
	332	"computation only" as opposed to what?
	333
	334	This is a term in concurrency for operations that compute without blocking,
	335	i.e., the operation starts with everything it needs to compute its result and
	336	runs to completion, blocking only when it is done and returns its result.
	337
	338
	339	in 2.2, in what way is a "request" fundamental to "synchronization"?
	340
	341	I assume you are referring to the last bullet.
	342
	343	Synchronization must be able to control the service order of requests
	344	including prioritizing selection from different kinds of outstanding requests,
	345	and postponing a request for an unspecified time while continuing to accept
	346	new requests.
	347
	348	Habermann states on page 173
	349
	350	Looking at deposit we see that sender and receiver should be synchronized with
	351	respect to buffer overflow: deposit of another message must be delayed if
	352	there is no empty message frame, and such delay should be removed when the
	353	first empty frame becomes available. This is programmed as
	354	synchronization(frame) : deposit is preceded by wait(frame), accept is
	355	followed by signal( frame), and the constant C[frame] is set equal to
	356	"bufsize."
	357
	358	Here synchronization is controlling the service order of requests: when the
	359	buffer is full, requests for insert (sender) are postponed until a request to
	360	remove (receiver) occurs. Without the ability to control service order among
	361	requests, the producer/consumer problem with a bounded buffer cannot be
	362	solved. Hence, this capability is fundamental.
	363
	364
	365	and the "implicitly" versus "explicitly" point needs stating as elsewhere,
	366	with a concrete example e.g. Java built-in mutexes versus
	367	java.util.concurrent).
	368
	369	Fixed.
	370
	371	MES must be available implicitly in language constructs, \eg Java built-in
	372	monitors, as well as explicitly for specialized requirements, \eg
	373	@java.util.concurrent@, because requiring programmers to build MES using
	374	low-level locks often leads to incorrect programs.
	375
	376
	377	section 6: 6.2 omits the most important facts in preference for otherwise
	378	inscrutable detail: "identify the kind of parameter" (first say *that there
	379	are* kinds of parameter, and what "kinds" means!); "mutex parameters are
	380	documentation" is misleading (they are also semantically significant!) and
	381	fails to say what they mean; the most important thing is surely that
	382	'mutex' is a language feature for performing lock/unlock operations at
	383	function entry/exit. So say it!
	384
	385	These sections have been rewritten address the comments.
	386
	387
	388	The meanings of examples f3 and f4 remain unclear.
	389
	390	Rewrote the paragraph.
	391
	392
	393	Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
	394	meaning from Figure 12,
	395
	396	Defined Hoare's urgent list at the start of the paragraph.
	397
	398
	399	but that Figure is incomprehensible to me), and we are told of "external
	400	scheduling"'s long history in Ada but not clearly what it actually means;
	401
	402	We do not know how to address your comment because the description is clear to
	403	us and other non-reviewers who have read the paper. I had forgotten an
	404	important citation to prior work on this topic, which is now referenced:
	405
	406	Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
	407	Classification. ACM Computing Surveys. 1995; 27(1):63-107.
	408
	409	This citation is a 45 page paper expanding on the topic of internal scheduling.
	410	I also added a citation to Hoare's monitor paper where signal_block is defined:
	411
	412	When a process signals a condition on which another process is waiting, the
	413	signalling process must wait until the resumed process permits it to
	414	proceed.
	415
	416	External scheduling from Ada was subsequently added to uC++ and Cforall.
	417	Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
	418	external-scheduling with Go channel external-scheduling. So external scheduling
	419	exists in languages beyond Ada.
	420
	421
	422	6.4's description of "waitfor" tells us it is different from an if-else
	423	chain but tries to use two different inputs to tell us that the behavior
	424	is different; tell us an instance where the same values of C1 and C2 give
	425	different behavior (I even wrote out a truth table and still don't see the
	426	semantic difference)
	427
	428	Again, it is unclear what is the problem. For the if-statement, if C1 is true,
	429	it only waits for a call to mem1, even if C2 is true and there is an
	430	outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
	431	is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
	432	happen with the if-statement. So for all true when clauses, any outstanding
	433	call is immediately accepted. If there are no outstanding calls, the waitfor
	434	blocks until the next call to any of the true when clauses occurs. I added the
	435	following sentence to further clarify.
	436
	437	Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
	438
	439	Note, the same parallel semantics exists for the Go select statement with
	440	respect to waiting for a set of channels to receive data. While Go select does
	441	not have a when clause, it would be trivial to add it, making the select more
	442	expressive.
	443
	444
	445	The authors frequently use bracketed phrases, and sometimes slashes "/", in
	446	ways that are confusing and/or detrimental to readability. Page 13 line
	447	2's "forward (backward)" is one particularly egregious example. In general
	448	I would recommend the the authors try to limit their use of parentheses and
	449	slashes as a means of forcing a clearer wording to emerge.
	450
	451	Many of the slashes and parentheticals have been removed. Some are retained to
	452	express joined concepts: call/return, suspend/resume, resume/resume, I/O.
	453
	454
	455	Also, the use of "eg." is often cursory and does not explain the examples
	456	given, which are frequently a one- or two-word phrase of unclear referent.
	457
	458	A few of these are fixed.
	459
	460
	461	Considering the revision more broadly, none of the more extensive or
	462	creative rewrites I suggested in my previous review have been attempted,
	463	nor any equivalent efforts to improve its readability.
	464
	465	If you reread the previous response, we addressed all of your suggestions except
	466
	467	An expositional idea occurs: start the paper with a strawman
	468	naive/limited realisation of coroutines -- say, Simon Tatham's popular
	469	"Coroutines in C" web page -- and identify point by point what the
	470	limitations are and how C\/ overcomes them. Currently the presentation
	471	is often flat (lacking motivating contrasts) and backwards (stating
	472	solutions before problems). The foregoing approach might fix both of
	473	these.
	474
	475	We prefer the current structure of our paper and believe the paper does
	476	explain basic coding limitations and how they are overcome in using
	477	high-level control-floe mechanisms.
	478
	479	and we have addressed readability issues in this version.
	480
	481
	482	The hoisting of the former section 5.9 is a good idea, but the newly added
	483	material accompanying it (around Table 1) suffers fresh deficiencies in
	484	clarity. Overall the paper is longer than before, even though (as my
	485	previous review stated), I believe a shorter paper is required in order to
	486	serve the likely purpose of publication. (Indeed, the authors' letter
	487	implies that a key goal of publication is to build community and gain
	488	external users.)
	489
	490	This comment is the referee's opinion, which we do not agree with it. Our
	491	previous 35 page SP&E paper on Cforall:
	492
	493	Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
	494	Programming Language Features to C. Softw. Pract. Exper. 2018;
	495	48(12):2111-2146.
	496
	497	has a similar structure and style to this paper, and it received an award from
	498	John Wiley & Sons:
	499
	500	Software: Practice & Experience for articles published between January 2017
	501	and December 2018, "most downloads in the 12 months following online
	502	publication showing the work generated immediate impact and visibility,
	503	contributing significantly to advancement in the field".
	504
	505	So we have demonstrated an ability to build community and gain external users.
	506
	507
	508	Given this trajectory, I no longer see a path to an acceptable revision of
	509	the present submission. Instead I suggest the authors consider splitting
	510	the paper in two: one half about coroutines and stack management, the other
	511	about mutexes, monitors and the runtime. (A briefer presentation of the
	512	runtime may be helpful in the first paper also, and a brief recap of the
	513	generator and coroutine support is obviously needed in the second too.)
	514
	515	Again we disagree with the referee's suggestion to vastly restructure the
	516	paper. What advantage is there is presenting exactly the same material across
	517	two papers, which will likely end up longer than a single paper? The current
	518	paper has a clear theme that fundamental execution properties generate a set of
	519	basic language mechanisms, and we then proceed to show how these mechanisms can
	520	be designed into the programing language Cforall.
	521
	522
	523	I do not buy the authors' defense of the limited practical experience or
	524	"non-micro" benchmarking presented. Yes, gaining external users is hard and
	525	I am sympathetic on that point. But building something at least somewhat
	526	substantial with your own system should be within reach, and without it the
	527	"practice and experience" aspects of the work have not been explored.
	528	Clearly C\/ is the product of a lot of work over an extended period, so it
	529	is a surprise that no such experience is readily available for inclusion.
	530
	531	Understood. There are no agreed-upon concurrency benchmarks, which is why
	532	micro-benchmarks are often used. Currently, the entire Cforall runtime is
	533	written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
	534	be thread safe, automatically detects the use of concurrency features at link
	535	time, and bootstraps into a threaded runtime almost immediately at program
	536	startup so threads can be declared as global variables and may run to
	537	completion before the program main starts. The concurrent core of the runtime
	538	is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
	539	and high-level features. In other words, the concurrent core uses itself as
	540	quickly as possible to start using high-level concurrency. There are 12,000+
	541	LOC in the Cforall tests-suite used to verify language features, which are run
	542	nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
	543	as aggressively testing each language feature, and classical examples such as
	544	bounded buffer, dating service, matrix summation, quickSort, binary insertion
	545	sort, etc. More experience will be available soon, based on ongoing work in
	546	the "future works" section. Specifically, non-blocking I/O is working with the
	547	new Linux io_uring interface and a new high-performance ready-queue is under
	548	construction to take into account this change. With non-blocking I/O, it will
	549	be possible to write applications like high-performance web servers, as is now
	550	done in Rust and Go. Also, completed is Java-style executors for work-based
	551	concurrent programming and futures. Under construction is a high-performance
	552	actor system.
	553
	554
	555	It does not seem right to state that a stack is essential to Von Neumann
	556	architectures -- since the earliest Von Neumann machines (and indeed early
	557	Fortran) did not use one.
	558
	559	Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
	560	https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
	561
	562	Since a subprogram may call for other subprograms to any desired depth, a
	563	particular CALL statement may be defined by a pyramid of multi-level
	564	subprograms.
	565
	566	I think we may be differing on the meaning of stack. You may be imagining a
[27125d0]	567	modern stack that grows and shrinks dynamically. Whereas early Fortran
[04b4a71]	568	preallocated a stack frame for each function, like Python allocates a frame for
	569	a generator. Within each preallocated Fortran function is a frame for local
	570	variables and a pointer to store the return value for a call. The Fortran
[27125d0]	571	call/return mechanism uses these frames to build a traditional call stack
[04b4a71]	572	linked by the return pointer. The only restriction is that a function stack
	573	frame can only be used once, implying no direct or indirect recursion. Hence,
	574	without a stack mechanism, there can be no call/return to "any desired depth",
	575	where the maximum desired depth is limited by the number of functions. So
	576	call/return requires some form of a stack, virtually all programming language
	577	have call/return, past and present, and these languages run on Von Neumann
	578	machines that do not distinguish between program and memory space, have mutable
	579	state, and the concept of a pointer to data or code.
	580
	581
	582	To elaborate on something another reviewer commented on: it is a surprise
	583	to find a "Future work" section after the "Conclusion" section. A
	584	"Conclusions and future work" section often works well.
	585
	586	Done.
	587
	588
	589
	590	Reviewing: 3
	591
	592	but it remains really difficult to have a good sense of which idea I should
	593	use and when. This applies in different ways to different features from the
	594	language:
	595
	596	* coroutines/generators/threads: here there is some discussion, but it can
	597	be improved.
	598	* interal/external scheduling: I didn't find any direct comparison between
	599	these features, except by way of example.
	600
	601	See changes below.
	602
	603
	604	I would have preferred something more like a table or a few paragraphs
	605	highlighting the key reasons one would pick one construct or the other.
	606
	607	Section 2.1 is changed to address this suggestion.
	608
	609
	610	The discussion of clusters and pre-emption in particular feels quite rushed.
	611
	612	We believe a brief introduction to the Cforall runtime structure is important
	613	because clustering within a user-level versus distributed system is unusual,
	614	Furthermore, the explanation of preemption is important because several new
	615	languages, like Go and Rust tokio, are not preemptive. Rust threads are
	616	preemptive only because it uses kernel threads, which UNIX preempts.
	617
	618
	619	* Recommend to shorten the comparison on coroutine/generator/threads in
	620	Section 2 to a paragraph with a few examples, or possibly a table
	621	explaining the trade-offs between the constructs
	622
	623	Not done, see below.
	624
	625
	626	* Recommend to clarify the relationship between internal/external
	627	scheduling -- is one more general but more error-prone or low-level?
	628
	629	Done, see below.
	630
	631
	632	There is obviously a lot of overlap between these features, and in
	633	particular between coroutines and generators. As noted in the previous
	634	review, many languages have chosen to offer only generators, and to
	635	build coroutines by stacks of generators invoking one another.
	636
[27125d0]	637	Coroutines built from stacks of generators have problems, such as no symmetric
	638	control-flow. Furthermore, stacks of generators have a problem with the
	639	following programming pattern. logger is a library function called from a
	640	function or a coroutine, where the doit for the coroutine suspends. With stacks
	641	of generators, there has to be a function and generator version of logger to
	642	support these two scenarios. If logger is a library function, it may be
	643	impossible to create the generator logger because the logger function is
[04b4a71]	644	opaque.
	645
	646	#include <fstream.hfa>
	647	#include <coroutine.hfa>
	648
	649	forall( otype T \| { void doit( T ); } )
	650	void logger( T & t ) {
	651	doit( t );
	652	}
	653
	654	coroutine C {};
	655	void main( C & c ) with( c ) {
	656	void doit( C & c ) { suspend; }
	657	logger( c );
	658	}
	659	void mem( C & c ) {
	660	resume( c );
	661	}
	662
	663	int main() {
	664	C c;
	665	mem( c );
	666	mem( c );
	667
	668	struct S {};
	669	S s;
	670	void doit( S & s ) {}
	671	logger( s );
	672	}
	673
[27125d0]	674	Additonal text has been added to the start of Section 3.2 address this comment.
[04b4a71]	675
	676
	677	In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
	678	that embodies this "top down" approach. It starts, "programmers can now
	679	answer three basic questions", and thus gives some practical advice for
	680	which construct you should use and when. I think giving some examples of
	681	specific applications that this paragraph, combined with some examples of
	682	cases where each construct was needed, would be a better approach.
	683
	684	I don't think this comparison needs to be very long. It seems clear enough
	685	that one would
	686
	687	* prefer generators for simple computations that yield up many values,
	688	* prefer coroutines for more complex processes that have significant
	689	internal structure,
	690
[27125d0]	691	We do not believe the number of values yielded is an important factor is
	692	choosing between a generator or coroutine; either form can receive (input) or
	693	return (output) millions of values. As well, simple versus complex computations
[b54118a]	694	is also not an absolute criterion for selection as both forms can be simple or
	695	complex.
	696
	697	As stated in the paper, a coroutine brings generality to the implementation
	698	because of the addition stack, whereas generators have restrictions on standard
	699	software-engineering practices: variable placement, no helper functions without
	700	creating an explicit generator stack, and no symmetric control-flow. It is
	701	true that simple generators probably suffer less from restrictions. Also,
	702	simplicity means the amount of work done is small so the cost of the
	703	suspend/resume becomes a factor, where experiments show suspend/resume for a
	704	generator is 10 times faster than for a coroutines.
	705
[04b4a71]	706
	707	* prefer threads for cases where parallel execution is desired or needed.
	708
	709	Agreed, but this description does not mention mutual exclusion and
	710	synchronization, which is essential in any meaningful concurrent program.
	711
	712	Our point here is to illustrate that a casual "top down" explanation is
	713	insufficient to explain the complexity of the underlying execution properties.
	714	We presented some rule-of-thumbs at the end of Section 2 but programmers must
	715	understand all the underlying mechanisms and their interactions to exploit the
	716	execution properties to their fullest, and to understand when a programming
	717	language does or does not provide a desired mechanism.
	718
	719
	720	I did appreciate the comparison in Section 2.3 between async-await in
	721	JS/Java and generators/coroutines. I agree with its premise that those
	722	mechanisms are a poor replacement for generators (and, indeed, JS has a
	723	distinct generator mechanism, for example, in part for this reason). I
	724	believe I may have asked for this in a previous review, but having read it,
	725	I wonder if it is really necessary, since those mechanisms are so different
	726	in purpose.
	727
[27125d0]	728	Given that you asked about this before, I believe other readers might also ask
	729	the same question because async-await is very popular. So I think this section
	730	does help to position the work in the paper among other work, and hence, it is
	731	appropriate to keep it in the paper.
[04b4a71]	732
	733
	734	I find the motivation for supporting both internal and external scheduling
	735	to be fairly implicit. After several reads through the section, I came to
	736	the conclusion that internal scheduling is more expressive than external
	737	scheduling, but sometimes less convenient or clear. Is this correct? If
	738	not, it'd be useful to clarify where external scheduling is more
	739	expressive.
	740
	741	I would find it very interesting to try and capture some of the properties
	742	that make internal vs external scheduling the better choice.
	743
	744	For example, it seems to me that external scheduling works well if there
	745	are only a few "key" operations, but that internal scheduling might be
	746	better otherwise, simply because it would be useful to have the ability to
	747	name a signal that can be referenced by many methods.
	748
	749	To address this point, the last paragraph on page 22 (now page 23) has been
	750	augmented to the following:
	751
	752	Given external and internal scheduling, what guidelines can a programmer use
	753	to select between them? In general, external scheduling is easier to
	754	understand and code because only the next logical action (mutex function(s))
	755	is stated, and the monitor implicitly handles all the details. Therefore,
	756	there are no condition variables, and hence, no wait and signal, which reduces
	757	coding complexity and synchronization errors. If external scheduling is
	758	simpler than internal, why not use it all the time? Unfortunately, external
	759	scheduling cannot be used if: scheduling depends on parameter value(s) or
	760	scheduling must block across an unknown series of calls on a condition
	761	variable (\ie internal scheduling). For example, the dating service cannot be
	762	written using external scheduling. First, scheduling requires knowledge of
	763	calling parameters to make matching decisions and parameters of calling
	764	threads are unavailable within the monitor. Specifically, a thread within the
	765	monitor cannot examine the @ccode@ of threads waiting on the calling queue to
	766	determine if there is a matching partner. (Similarly, if the bounded buffer
	767	or readers/writer are restructured with a single interface function with a
	768	parameter denoting producer/consumer or reader/write, they cannot be solved
	769	with external scheduling.) Second, a scheduling decision may be delayed
	770	across an unknown number of calls when there is no immediate match so the
	771	thread in the monitor must block on a condition. Specifically, if a thread
	772	determines there is no opposite calling thread with the same @ccode@, it must
	773	wait an unknown period until a matching thread arrives. For complex
	774	synchronization, both external and internal scheduling can be used to take
	775	advantage of best of properties of each.
	776
	777
	778	Consider the bounded buffer from Figure 13: if it had multiple methods for
	779	removing elements, and not just `remove`, then the `waitfor(remove)` call
	780	in `insert` might not be sufficient.
	781
	782	Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
	783	your request:
	784
	785	waitfor( remove : buffer ); or waitfor( remove2 : buffer );
	786
	787	and its shorthand form (not shown in the paper)
	788
	789	waitfor( remove, remove2 : t );
	790
	791	A call to one these remove functions satisfies the waitfor (exact selection
	792	details are discussed in Section 6.4).
	793
	794
	795	The same is true, I think, of the `signal_block` function, which I
	796	have not encountered before;
	797
	798	In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
	799	Structuring Concept", it states on page 551:
	800
	801	When a process signals a condition on which another process is waiting, the
	802	signalling process must wait until the resumed process permits it to
	803	proceed. We therefore introduce for each monitor a second semaphore "urgent"
	804	(initialized to 0), on which signalling processes suspend themselves by the
	805	operation P(urgent).
	806
	807	Hence, the original definition of signal is in fact signal_block, i.e., the
	808	signaller blocks. Later implementations of monitor switched to signaller
	809	nonblocking because most signals occur before returns, which allows the
	810	signaller to continue execution, exit the monitor, and run concurrently with
	811	the signalled thread that restarts in the monitor. When the signaller is not
	812	going to exit immediately, signal_block is appropriate.
	813
	814
	815	it seems like its behavior can be modeled with multiple condition
	816	variables, but that's clearly more complex.
	817
	818	Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
	819	Surveys, 27(1):63-107, that all extant monitors with different signalling
	820	semantics can be transformed into each other. However, some transformations are
	821	complex and runtime expensive.
	822
	823
	824	One question I had about `signal_block`: what happens if one signals
	825	but no other thread is waiting? Does it block until some other thread
	826	waits? Or is that user error?
	827
	828	On page 20, it states:
	829
	830	Signalling is unconditional because signalling an empty condition queue does
	831	nothing.
	832
	833	To the best of our knowledge, all monitors have the same semantics for
	834	signalling an empty condition queue, regardless of the kind of signal, i.e.,
	835	signal or signal_block.
	836
	837	I believe that one difference between the Go program and the Cforall
	838	equivalent is that the Goroutine has an associated queue, so that
	839	multiple messages could be enqueued, whereas the Cforall equivalent is
	840	effectively a "bounded buffer" of length 1. Is that correct?
	841
	842	Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
	843	send so both are synchronous communication.
	844
	845	I think this should be stated explicitly. (Presumably, one could modify the
	846	Cforall program to include an explicit vector of queued messages if
	847	desired, but you would also be reimplementing the channel abstraction.)
	848
	849	Fixed, by adding the following sentences:
	850
	851	The different between call and channel send occurs for buffered channels
	852	making the send asynchronous. In \CFA, asynchronous call and multiple
	853	buffers is provided using an administrator and worker
	854	threads~\cite{Gentleman81} and/or futures (not discussed).
	855
	856
	857	Also, in Figure 20, I believe that there is a missing `mutex` keyword.
	858
	859	Fixed.
	860
	861
	862	I was glad to see that the paper acknowledged that Cforall still had
	863	low-level atomic operations, even if their use is discouraged in favor of
	864	higher-level alternatives.
	865
	866	There was never an attempt to not acknowledged that Cforall had low-level
	867	atomic operations. The original version of the paper stated:
	868
	869	6.6 Low-level Locks
	870	For completeness and efficiency, Cforall provides a standard set of low-level
	871	locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
	872	instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
	873
	874	and that section is still in the paper. In fact, we use these low-level
	875	mechanisms to build all of the high-level concurrency constructs in Cforall.
	876
	877
	878	However, I still feel that the conclusion overstates the value of the
	879	contribution here when it says that "Cforall high-level race-free monitors
	880	and threads provide the core mechanisms for mutual exclusion and
	881	synchronization, without the need for volatile and atomics". I feel
	882	confident that Java programmers, for example, would be advised to stick
	883	with synchronized methods whenever possible, and it seems to me that they
	884	offer similar advantages -- but they sometimes wind up using volatiles for
	885	performance reasons.
	886
	887	I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
	888	programs can achieve very good performance without volatile or atomics because
	889	the runtime system has already used these low-level capabilities to build an
	890	efficient set of high-level concurrency constructs.
	891
	892	0.1% of the time programmers need to build their own locks and synchronization
	893	mechanisms. This need also occurs for storage management. Both of these
	894	mechanisms are allowed in Cforall but are fraught with danger and should be
	895	discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
	896	fencing for a WSO memory model, such as on the ARM. It doesn't help that the
	897	C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
	898	Dave Dice and me would agree that 99% of hand-crafted locks created by
	899	programmers are broken and/or non-portable.
	900
	901
	902	I was also confused by the term "race-free" in that sentence. In
	903	particular, I don't think that Cforall has any mechanisms for preventing
	904	data races, and it clearly doesn't prevent "race conditions" (which would
	905	bar all sorts of useful programs). I suppose that "race free" here might be
	906	referring to the improvements such as removing barging behavior.
	907
	908	We use the term "race free" to mean the same as Boehm/Adve's "data-race
	909	freedom" in
	910
	911	Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
	912	Memory Models. Communications ACM. 2012; 55(2):48-54.
	913	https://queue.acm.org/detail.cfm?id=2088916
	914
	915	which is cited in the paper. Furthermore, we never said that Cforall has
	916	mechanisms for preventing all data races. We said Cforall high-level
	917	race-free monitors and threads[, when used with mutex access function] (added
	918	to the paper), implies no data races within these constructs, unless a
	919	programmer directly publishes shared state. This approach is exactly what
	920	Boehm/Adve advocate for the vast majority of concurrent programming.
	921
	922
	923	It would perhaps be more interesting to see a comparison built using [tokio] or
	924	[async-std], two of the more prominent user-space threading libraries that
	925	build on Rust's async-await feature (which operates quite differently than
	926	Javascript's async-await, in that it doesn't cause every aync function call to
	927	schedule a distinct task).
	928
	929	Done.
	930
	931
	932	Several figures used the `with` keyword. I deduced that `with(foo)` permits
	933	one to write `bar` instead of `foo.bar`. It seems worth introducing.
	934	Apologies if this is stated in the paper, if so I missed it.
	935
	936	Page 6, footnote F states:
	937
	938	The Cforall "with" clause opens an aggregate scope making its fields directly
	939	accessible, like Pascal "with", but using parallel semantics; multiple
	940	aggregates may be opened.
	941
	942
	943	On page 20, section 6.3, "external scheduling and vice versus" should be
	944	"external scheduling and vice versa".
	945
	946	Fixed.
	947
	948
	949	On page 5, section 2.3, the paper states "we content" but it should be "we
	950	contend".
	951
	952	Fixed.
	953
	954
	955	Page 1. I don't believe that it s fair to imply that Scala is "research
	956	vehicle" as it is used by major players, Twitter being the most prominent
	957	example.
	958
	959	Fixed. Removed "research".
	960
	961
	962	Page 15. Must Cforall threads start after construction (e.g. see your
	963	example on page 15, line 21)?
	964
	965	Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
	966	programmers want threads to start immediately. (Most Java programs have no code
	967	between a thread declaration and the call to start the thread.) Therefore,
	968	this semantic should be the default because (see page 13):
	969
	970	Alternatives, such as explicitly starting threads as in Java, are repetitive
	971	and forgetting to call start is a common source of errors.
	972
	973	To handle the other 5% of the cases, there is a trivial Cforall pattern
[27125d0]	974	providing Java-style start/join. The additional cost for this pattern is small
	975	in comparison to the work performed between the start and join.
[04b4a71]	976
	977	thread T {};
	978	void start( T & mutex ) {} // any function name
	979	void join( T & mutex ) {} // any function name
	980	void main( T & t ) {
	981	sout \| "start";
	982	waitfor( start : t ); // wait to be started
	983	sout \| "restart"; // perform work
	984	waitfor( join : t ); // wait to be joined
	985	sout \| "join";
	986	}
	987	int main() {
	988	T t[3]; // threads start and delay
	989	sout \| "need to start";
	990	for ( i; 3 ) start( t[i] );
	991	sout \| "need to join";
	992	for ( i; 3 ) join( t[i] );
	993	sout \| "threads stopped";
	994	} // threads deleted from stack
	995
	996	$ a.out
	997	need to start
	998	start
	999	start
	1000	start
	1001	need to join
	1002	restart
	1003	restart
	1004	restart
	1005	threads stopped
	1006	join
	1007	join
	1008	join
	1009
	1010
	1011	Page 18, line 17: is using
	1012
	1013	Fixed.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format