Context Navigation

response2@ a4da45e

Visit:

Last change on this file since a4da45e was 016b1eb, checked in by Peter A. Buhr <pabuhr@…>, 6 years ago
final changes for round 2 of the SP&E concurrency paper
Property mode set to `100644`
File size: 44.7 KB

Rev	Line
[04b4a71]	1	Reviewing: 1
	2
	3	I still have a couple of issues --- perhaps the largest is that it's
	4	still not clear at this point in the paper what some of these options
	5	are, or crucially how they would be used. I don't know if it's
	6	possible to give high-level examples or use cases to be clear about
	7	these up front - or if that would duplicate too much information from
	8	later in the paper - either way expanding out the discussion - even if
	9	just two a couple of sentences for each row - would help me more.
	10
	11	Section 2.1 is changed to address this suggestion.
	12
	13
	14	* 1st para section 2 begs the question: why not support each
	15	dimension independently, and let the programmer or library designer
	16	combine features?
	17
	18	As seen in Table 1, not all of the combinations work, and having programmers
	19	directly use these low-level mechanisms is error prone. Accessing these
	20	fundamental mechanisms through higher-level constructs has always been the
	21	purpose of a programming language.
	22
	23
	24	* Why must there "be language mechanisms to create, block/unblock, and join
	25	with a thread"? There aren't in Smalltalk (although there are in the
	26	runtime). Especially given in Cforall those mechanisms are implicit on
	27	thread creation and destruction?
	28
[016b1eb]	29	Fixed, changed sentence to:
[04b4a71]	30
[016b1eb]	31	A programmer needs mechanisms to create, block and unblock, and join with a
	32	thread, even if these basic mechanisms are supplied indirectly through
	33	high-level features.
[04b4a71]	34
	35
	36	* "Case 1 is a function that borrows storage for its state (stack
	37	frame/activation) and a thread from its invoker"
	38
	39	this much makes perfect sense to me, but I don't understand how a
	40	non-stateful, non-threaded function can then retain
	41
	42	"this state across callees, ie, function local-variables are
	43	retained on the stack across calls."
	44
	45	how can it retain function-local values across calls when it
	46	doesn't have any functional-local state?
	47
	48	In the following example:
	49
	50	void foo() {
	51	// local variables and code
	52	}
	53	void bar() {
	54	// local variables
	55	foo();
	56	}
	57
	58	bar is the caller and foo is the callee. bar borrows the program stack and
	59	thread to make the call to foo. When foo, the callee, is executing, bar's local
	60	variables (state) is retained on the borrowed stack across the call. (Note, I
	61	added borrowed to that sentence in the paper to help clarify.) Furthermore,
	62	foo's local variables are also retain on the borrowed stack. When foo and bar
	63	return, all of their local state is gone (not retained). This behaviour is
	64	standard call/return semantics in an imperative language.
	65
	66
	67	I'm not sure if I see two separate cases here - roughly equivalent
	68	to C functions without static storage, and then C functions with
	69	static storage.
	70
	71	Yes, but there is only one instance of the static storage across all
	72	activations of the C function. For generators and coroutines, each instance has
	73	its own state, like an object in an OO language.
	74
	75
	76	I assumed that was the distinction between cases 1 & 3; but perhaps the
	77	actual distinction is that 3 has a suspend/resume point, and so the
	78	"state" in figure 1 is this component of execution state (viz figs 1 & 2),
	79	not the state representing the cross-call variables?
	80
	81	So case 3 is like an object with the added ability to retain where it was last
	82	executing. When a generator is resumed, the generator object (structure
	83	instance) is passed as an explicit reference, and within this object is the
	84	restart location in the generator's "main". When the generator main executes,
	85	it uses the borrowed stack for its local variables and any functions it calls,
	86	just like an object member borrows the stack for its local variables but also
	87	has an implicit receiver to the object state. A generator can have static
	88	storage, too, which is a single instance across all generator instances of that
	89	type, as for static storage in an object type. All the kinds of storage are
[016b1eb]	90	at play with semantics that is the same as in other languages.
[04b4a71]	91
	92
	93	> but such evaluation isn't appropriate for garbage-collected or JITTed
	94	> languages like Java or Go.
	95	For JITTed languages in particular, reporting peak performance needs to
	96	"warm up" the JIT with a number of iterators before beginning
	97	measurement. Actually for JIT's its even worse: see Edd Barrett et al
	98	OOPSLA 2017.
	99
[27125d0]	100	Of our testing languages, only Java and Node.js are JITTED. To ensure the Java
	101	test-programs correctly measured the specific feature, we consulted with Dave
	102	Dice at Oracle who works directly on the development of the Oracle JVM
	103	Just-in-Time Compiler. We modified our test programs based on his advise, and
	104	he validated our programs as correctly measuring the specified language
[016b1eb]	105	feature. Dave's help is recognized in the Acknowledgment section. Similarly,
	106	we verified our Node.js programs with Gregor Richards, an expert in
	107	just-in-time compilation for dynamic typing. Hence, we have taken into account
	108	all issues related to performing benchmarks in JITTED languages. Also, all the
	109	benchmark programs are publicly available for independent verification.
[04b4a71]	110
	111
	112	* footnote A - I've looked at various other papers & the website to try to
	113	understand how "object-oriented" Cforall is - I'm still not sure. This
	114	footnote says Cforall has "virtuals" - presumably virtual functions,
	115	i.e. dynamic dispatch - and inheritance: that really is OO as far as I
	116	(and most OO people) are concerned. For example Haskell doesn't have
	117	inheritance, so it's not OO; while CLOS (the Common Lisp Object System)
	118	or things like Cecil and Dylan are considered OO even though they have
	119	"multiple function parameters as receivers", lack "lexical binding between
	120	a structure and set of functions", and don't have explicit receiver
	121	invocation syntax. Python has receiver syntax, but unlike Java or
	122	Smalltalk or C++, method declarations still need to have an explicit
	123	"self" receiver parameter. Seems to me that Go, for example, is
	124	more-or-less OO with interfaces, methods, and dynamic dispatch (yes also
	125	and an explicit receiver syntax but that's not determinative); while Rust
	126	lacks dynamic dispatch built-in. C is not OO as a language, but as you
	127	say given it supports function pointers with structures, it does support
	128	an OO programming style.
	129
	130	This is why I again recommend just not buying into this fight: not making
	131	any claims about whether Cforall is OO or is not - because as I see it,
	132	the rest of the paper doesn't depend on whether Cforall is OO or not.
	133	That said: this is just a recommendation, and I won't quibble over this
	134	any further.
	135
	136	We believe it is important to identify Cforall as a non-OO language because it
	137	heavily influences the syntax and semantics used to build its concurrency.
	138	Since many aspects of Cforall are not OO, the rest of the paper does depend
	139	on Cforall being identified as non-OO, otherwise readers would have
[016b1eb]	140	significantly different expectations for the design. We did not mean to suggest
	141	that languages that support function pointers with structures support an OO
	142	style. We revised the footnote to avoid this interpretation. Finally, Golang
	143	does not identify itself as an OO language.
	144	https://golang.org/doc/faq#Is_Go_an_object-oriented_language
[04b4a71]	145
	146
	147	* is a "monitor function" the same as a "mutex function"?
	148	if so the paper should pick one term; if not, make the distinction clear.
	149
	150	Fixed. Picked "mutex". Changed the language and all places in the paper.
	151
	152
	153	* "As stated on line 1 because state declarations from the generator
	154	type can be moved out of the coroutine type into the coroutine main"
	155
	156	OK sure, but again: why would a programmer want to do that?
	157	(Other than, I guess, to show the difference between coroutines &
	158	generators?) Perhaps another way to put this is that the first
	159	para of 3.2 gives the disadvantages of coroutines vs-a-vs
	160	generators, briefly describes the extended semantics, but never
	161	actually says why a programmer may want those extended semantics,
	162	or how they would benefit. I don't mean to belabour the point,
	163	but (generalist?) readers like me would generally benefit from
	164	those kinds of discussions about each feature throughout the
	165	paper: why might a programmer want to use them?
	166
	167	On page 8, it states:
	168
	169	Having to manually create the generator closure by moving local-state
	170	variables into the generator type is an additional programmer burden (removed
	171	by the coroutine in Section 3.2). ...
	172
	173	also these variables can now be refactored into helper function, where the
	174	helper function suspends at arbitrary call depth. So imagine a coroutine helper
	175	function that is only called occasionally within the coroutine but it has a
	176	large array that is retained across suspends within the helper function. For a
	177	generator, this large array has to be declared in the generator type enlarging
	178	each generator instance even through the array is only used occasionally.
	179	Whereas, the coroutine only needs the array allocated when needed. Now a
	180	coroutine has a stack which occupies storage, but the maximum stack size only
[016b1eb]	181	needs to be the call chain allocating the most storage, whereas the generator
[04b4a71]	182	has a maximum size of all variable that could be created.
	183
	184
	185	> p17 if the multiple-monitor entry procedure really is novel, write a paper
	186	> about that, and only about that.
	187	> We do not believe this is a practical suggestion.
	188	* I'm honestly not trying to be snide here: I'm not an expert on monitor or
	189	concurrent implementations. Brinch Hansen's original monitors were single
	190	acquire; this draft does not cite any other previous work that I could
	191	see. I'm not suggesting that the brief mention of this mechanism
	192	necessarily be removed from this paper, but if this is novel (and a clear
	193	advance over a classical OO monitor a-la Java which only acquires the
	194	distinguished receiver) then that would be worth another paper in itself.
	195
	196	First, to explain multiple-monitor entry in Cforall as a separate paper would
	197	require significant background on Cforall concurrency, which means repeating
	198	large sections of this paper. Second, it feels like a paper just on
	199	multiple-monitor entry would be a little thin, even if the capability is novel
	200	to the best of our knowledge. Third, we feel multiple-monitor entry springs
	201	naturally from the overarching tone in the paper that Cforall is a non-OO
	202	programming language, allowing multiple-mutex receivers.
	203
	204
	205	My typo: the paper's conclusion should come at the end, after the
	206	future work section.
	207
	208	Combined into a Conclusions and Future Work section.
	209
	210
	211
	212	Reviewing: 2
	213
	214	on the Boehm paper and whether code is "all sequential to the compiler": I
	215	now understand the authors' position better and suspect we are in violent
	216	agreement, except for whether it's appropriate to use the rather breezy
	217	phrase "all sequential to the compiler". It would be straightforward to
	218	clarify that code not using the atomics features is optimized as if it
	219	were sequential, i.e. on the assumption of a lack of data races.
	220
	221	Fixed, "as inline and library code is compiled as sequential without any
	222	explicit concurrent directive."
	223
	224
	225	on the distinction between "mutual exclusion" and "synchronization": the
	226	added citation does help, in that it makes a coherent case for the
	227	definition the authors prefer. However, the text could usefully clarify
	228	that this is a matter of definition not of fact, given especially that in
	229	my assessment the authors' preferred definition is not the most common
	230	one. (Although the mention of Hoare's apparent use of this definition is
	231	one data point, countervailing ones are found in many contemporaneous or
	232	later papers, e.g. Habermann's 1972 "Synchronization of Communicating
	233	Processes" (CACM 15(3)), Reed & Kanodia's 1979 "Synchronization with
	234	eventcounts and sequencers" (CACM (22(2)) and so on.)
	235
	236	Fixed, "We contend these two properties are independent, ...".
	237
	238	With respect to the two papers, Habermann fundamentally agrees with our
	239	definitions, where the term mutual exclusion is the same but Habermann uses the
	240	term "communication" for our synchronization. However, the term "communication"
	241	is rarely used to mean synchronization. The fact that Habermann collectively
	242	calls these two mechanisms synchronization is the confusion. Reed & Kanodia
	243	state:
	244
	245	By mutual exclusion we mean any mechanism that forces the time ordering of
	246	execution of pieces of code, called critical regions, in a system of
	247	concurrent processes to be a total ordering.
	248
	249	But there is no timing order for a critical region (which I assume means
	250	critical section); threads can arrive at any time in any order, where the
	251	mutual exclusion for the critical section ensures one thread executes at a
[27125d0]	252	time. Interestingly, Reed & Kanodia's critical region is Habermann's
	253	communication not critical section. These papers only buttress our contention
[04b4a71]	254	about the confusion of these terms in the literature.
	255
	256
	257	section 2 (an expanded version of what was previously section 5.9) lacks
	258	examples and is generally obscure and allusory ("the most advanced feature"
	259	-- name it! "in triplets" -- there is only one triplet!;
	260
	261	Fixed.
	262
	263	These independent properties can be used to compose different language
	264	features, forming a compositional hierarchy, where the combination of all
	265	three is the most advanced feature, called a thread/task/process. While it is
	266	possible for a language to only provide threads for composing
	267	programs~\cite{Hermes90}, this unnecessarily complicates and makes inefficient
	268	solutions to certain classes of problems.
	269
	270
	271	what are "execution locations"? "initialize" and "de-initialize"
	272
	273	Fixed through an example at the end of the sentence.
	274
	275	\item[\newterm{execution state}:] is the state information needed by a
[016b1eb]	276	control-flow feature to initialize and manage both compute data and execution
	277	location(s), and de-initialize. For example, calling a function initializes a
	278	stack frame including contained objects with constructors, manages local data
	279	in blocks and return locations during calls, and de-initializes the frame by
[04b4a71]	280	running any object destructors and management operations.
	281
	282
	283	what? "borrowed from the invoker" is a concept in need of explaining or at
	284	least a fully explained example -- in what sense does a plain function
	285	borrow" its stack frame?
	286
	287	A function has no storage except for static storage; it gets its storage from
	288	the program stack. For a function to run it must borrow storage from somewhere.
	289	When the function returns, the borrowed storage is returned. Do you have a more
	290	appropriate word?
	291
[016b1eb]	292
[04b4a71]	293	"computation only" as opposed to what?
	294
	295	This is a term in concurrency for operations that compute without blocking,
	296	i.e., the operation starts with everything it needs to compute its result and
	297	runs to completion, blocking only when it is done and returns its result.
[016b1eb]	298	Computation only occurs in "embarrassingly parallel" problems.
[04b4a71]	299
	300
	301	in 2.2, in what way is a "request" fundamental to "synchronization"?
	302
	303	I assume you are referring to the last bullet.
	304
	305	Synchronization must be able to control the service order of requests
	306	including prioritizing selection from different kinds of outstanding requests,
	307	and postponing a request for an unspecified time while continuing to accept
	308	new requests.
	309
	310	Habermann states on page 173
	311
	312	Looking at deposit we see that sender and receiver should be synchronized with
	313	respect to buffer overflow: deposit of another message must be delayed if
	314	there is no empty message frame, and such delay should be removed when the
	315	first empty frame becomes available. This is programmed as
	316	synchronization(frame) : deposit is preceded by wait(frame), accept is
	317	followed by signal( frame), and the constant C[frame] is set equal to
	318	"bufsize."
	319
	320	Here synchronization is controlling the service order of requests: when the
	321	buffer is full, requests for insert (sender) are postponed until a request to
	322	remove (receiver) occurs. Without the ability to control service order among
	323	requests, the producer/consumer problem with a bounded buffer cannot be
	324	solved. Hence, this capability is fundamental.
	325
	326
	327	and the "implicitly" versus "explicitly" point needs stating as elsewhere,
	328	with a concrete example e.g. Java built-in mutexes versus
	329	java.util.concurrent).
	330
	331	Fixed.
	332
	333	MES must be available implicitly in language constructs, \eg Java built-in
	334	monitors, as well as explicitly for specialized requirements, \eg
	335	@java.util.concurrent@, because requiring programmers to build MES using
	336	low-level locks often leads to incorrect programs.
	337
	338
	339	section 6: 6.2 omits the most important facts in preference for otherwise
	340	inscrutable detail: "identify the kind of parameter" (first say *that there
	341	are* kinds of parameter, and what "kinds" means!); "mutex parameters are
	342	documentation" is misleading (they are also semantically significant!) and
	343	fails to say what they mean; the most important thing is surely that
	344	'mutex' is a language feature for performing lock/unlock operations at
	345	function entry/exit. So say it!
	346
	347	These sections have been rewritten address the comments.
	348
	349
	350	The meanings of examples f3 and f4 remain unclear.
	351
	352	Rewrote the paragraph.
	353
	354
	355	Meanwhile in 6.3, "urgent" is not introduced (we are supposed to infer its
	356	meaning from Figure 12,
	357
	358	Defined Hoare's urgent list at the start of the paragraph.
	359
	360
	361	but that Figure is incomprehensible to me), and we are told of "external
	362	scheduling"'s long history in Ada but not clearly what it actually means;
	363
	364	We do not know how to address your comment because the description is clear to
	365	us and other non-reviewers who have read the paper. I had forgotten an
	366	important citation to prior work on this topic, which is now referenced:
	367
	368	Buhr Peter A., Fortier Michel, Coffin Michael H.. Monitor
	369	Classification. ACM Computing Surveys. 1995; 27(1):63-107.
	370
	371	This citation is a 45 page paper expanding on the topic of internal scheduling.
	372	I also added a citation to Hoare's monitor paper where signal_block is defined:
	373
	374	When a process signals a condition on which another process is waiting, the
	375	signalling process must wait until the resumed process permits it to
	376	proceed.
	377
	378	External scheduling from Ada was subsequently added to uC++ and Cforall.
	379	Furthermore, Figure 20 shows a direct comparison of CFA procedure-call
	380	external-scheduling with Go channel external-scheduling. So external scheduling
	381	exists in languages beyond Ada.
	382
	383
	384	6.4's description of "waitfor" tells us it is different from an if-else
	385	chain but tries to use two different inputs to tell us that the behavior
	386	is different; tell us an instance where the same values of C1 and C2 give
	387	different behavior (I even wrote out a truth table and still don't see the
	388	semantic difference)
	389
	390	Again, it is unclear what is the problem. For the if-statement, if C1 is true,
	391	it only waits for a call to mem1, even if C2 is true and there is an
	392	outstanding call to mem2. For the waitfor, if both C1 and C2 are true and there
	393	is a call to mem2 but not mem1, it accepts the call to mem2, which cannot
	394	happen with the if-statement. So for all true when clauses, any outstanding
	395	call is immediately accepted. If there are no outstanding calls, the waitfor
	396	blocks until the next call to any of the true when clauses occurs. I added the
	397	following sentence to further clarify.
	398
	399	Hence, the @waitfor@ has parallel semantics, accepting any true @when@ clause.
	400
	401	Note, the same parallel semantics exists for the Go select statement with
	402	respect to waiting for a set of channels to receive data. While Go select does
	403	not have a when clause, it would be trivial to add it, making the select more
	404	expressive.
	405
	406
	407	The authors frequently use bracketed phrases, and sometimes slashes "/", in
	408	ways that are confusing and/or detrimental to readability. Page 13 line
	409	2's "forward (backward)" is one particularly egregious example. In general
	410	I would recommend the the authors try to limit their use of parentheses and
	411	slashes as a means of forcing a clearer wording to emerge.
	412
	413	Many of the slashes and parentheticals have been removed. Some are retained to
	414	express joined concepts: call/return, suspend/resume, resume/resume, I/O.
	415
	416
	417	Also, the use of "eg." is often cursory and does not explain the examples
	418	given, which are frequently a one- or two-word phrase of unclear referent.
	419
	420	A few of these are fixed.
	421
	422
	423	Considering the revision more broadly, none of the more extensive or
	424	creative rewrites I suggested in my previous review have been attempted,
	425	nor any equivalent efforts to improve its readability.
	426
	427	If you reread the previous response, we addressed all of your suggestions except
	428
	429	An expositional idea occurs: start the paper with a strawman
	430	naive/limited realisation of coroutines -- say, Simon Tatham's popular
	431	"Coroutines in C" web page -- and identify point by point what the
	432	limitations are and how C\/ overcomes them. Currently the presentation
	433	is often flat (lacking motivating contrasts) and backwards (stating
	434	solutions before problems). The foregoing approach might fix both of
	435	these.
	436
	437	We prefer the current structure of our paper and believe the paper does
	438	explain basic coding limitations and how they are overcome in using
	439	high-level control-floe mechanisms.
	440
	441	and we have addressed readability issues in this version.
	442
	443
	444	The hoisting of the former section 5.9 is a good idea, but the newly added
	445	material accompanying it (around Table 1) suffers fresh deficiencies in
	446	clarity. Overall the paper is longer than before, even though (as my
	447	previous review stated), I believe a shorter paper is required in order to
	448	serve the likely purpose of publication. (Indeed, the authors' letter
	449	implies that a key goal of publication is to build community and gain
	450	external users.)
	451
	452	This comment is the referee's opinion, which we do not agree with it. Our
	453	previous 35 page SP&E paper on Cforall:
	454
	455	Moss Aaron, Schluntz Robert, Buhr Peter A.. Cforall: Adding Modern
	456	Programming Language Features to C. Softw. Pract. Exper. 2018;
	457	48(12):2111-2146.
	458
	459	has a similar structure and style to this paper, and it received an award from
	460	John Wiley & Sons:
	461
	462	Software: Practice & Experience for articles published between January 2017
	463	and December 2018, "most downloads in the 12 months following online
	464	publication showing the work generated immediate impact and visibility,
	465	contributing significantly to advancement in the field".
	466
	467	So we have demonstrated an ability to build community and gain external users.
	468
	469
	470	Given this trajectory, I no longer see a path to an acceptable revision of
	471	the present submission. Instead I suggest the authors consider splitting
	472	the paper in two: one half about coroutines and stack management, the other
	473	about mutexes, monitors and the runtime. (A briefer presentation of the
	474	runtime may be helpful in the first paper also, and a brief recap of the
	475	generator and coroutine support is obviously needed in the second too.)
	476
	477	Again we disagree with the referee's suggestion to vastly restructure the
	478	paper. What advantage is there is presenting exactly the same material across
	479	two papers, which will likely end up longer than a single paper? The current
	480	paper has a clear theme that fundamental execution properties generate a set of
	481	basic language mechanisms, and we then proceed to show how these mechanisms can
	482	be designed into the programing language Cforall.
	483
	484
	485	I do not buy the authors' defense of the limited practical experience or
	486	"non-micro" benchmarking presented. Yes, gaining external users is hard and
	487	I am sympathetic on that point. But building something at least somewhat
	488	substantial with your own system should be within reach, and without it the
	489	"practice and experience" aspects of the work have not been explored.
	490	Clearly C\/ is the product of a lot of work over an extended period, so it
	491	is a surprise that no such experience is readily available for inclusion.
	492
	493	Understood. There are no agreed-upon concurrency benchmarks, which is why
	494	micro-benchmarks are often used. Currently, the entire Cforall runtime is
	495	written in Cforall (10,000+ lines of code (LOC)). This runtime is designed to
	496	be thread safe, automatically detects the use of concurrency features at link
	497	time, and bootstraps into a threaded runtime almost immediately at program
	498	startup so threads can be declared as global variables and may run to
	499	completion before the program main starts. The concurrent core of the runtime
	500	is 3,500 LOC and bootstraps from low-level atomic primitives into Cforall locks
	501	and high-level features. In other words, the concurrent core uses itself as
	502	quickly as possible to start using high-level concurrency. There are 12,000+
	503	LOC in the Cforall tests-suite used to verify language features, which are run
	504	nightly. Of theses, there are 2000+ LOC running standard concurrent tests, such
	505	as aggressively testing each language feature, and classical examples such as
	506	bounded buffer, dating service, matrix summation, quickSort, binary insertion
	507	sort, etc. More experience will be available soon, based on ongoing work in
	508	the "future works" section. Specifically, non-blocking I/O is working with the
	509	new Linux io_uring interface and a new high-performance ready-queue is under
	510	construction to take into account this change. With non-blocking I/O, it will
	511	be possible to write applications like high-performance web servers, as is now
	512	done in Rust and Go. Also, completed is Java-style executors for work-based
	513	concurrent programming and futures. Under construction is a high-performance
	514	actor system.
	515
	516
	517	It does not seem right to state that a stack is essential to Von Neumann
	518	architectures -- since the earliest Von Neumann machines (and indeed early
	519	Fortran) did not use one.
	520
	521	Reference Manual Fortran II for the IBM 704 Data Processing System, 1958 IBM, page 2
	522	https://archive.computerhistory.org/resources/text/Fortran/102653989.05.01.acc.pdf
	523
	524	Since a subprogram may call for other subprograms to any desired depth, a
	525	particular CALL statement may be defined by a pyramid of multi-level
	526	subprograms.
	527
	528	I think we may be differing on the meaning of stack. You may be imagining a
[27125d0]	529	modern stack that grows and shrinks dynamically. Whereas early Fortran
[04b4a71]	530	preallocated a stack frame for each function, like Python allocates a frame for
	531	a generator. Within each preallocated Fortran function is a frame for local
	532	variables and a pointer to store the return value for a call. The Fortran
[27125d0]	533	call/return mechanism uses these frames to build a traditional call stack
[04b4a71]	534	linked by the return pointer. The only restriction is that a function stack
	535	frame can only be used once, implying no direct or indirect recursion. Hence,
	536	without a stack mechanism, there can be no call/return to "any desired depth",
	537	where the maximum desired depth is limited by the number of functions. So
	538	call/return requires some form of a stack, virtually all programming language
	539	have call/return, past and present, and these languages run on Von Neumann
	540	machines that do not distinguish between program and memory space, have mutable
	541	state, and the concept of a pointer to data or code.
	542
	543
	544	To elaborate on something another reviewer commented on: it is a surprise
	545	to find a "Future work" section after the "Conclusion" section. A
	546	"Conclusions and future work" section often works well.
	547
	548	Done.
	549
	550
	551
	552	Reviewing: 3
	553
	554	but it remains really difficult to have a good sense of which idea I should
	555	use and when. This applies in different ways to different features from the
	556	language:
	557
	558	* coroutines/generators/threads: here there is some discussion, but it can
	559	be improved.
[016b1eb]	560	* internal/external scheduling: I didn't find any direct comparison between
[04b4a71]	561	these features, except by way of example.
	562
	563	See changes below.
	564
	565
	566	I would have preferred something more like a table or a few paragraphs
	567	highlighting the key reasons one would pick one construct or the other.
	568
	569	Section 2.1 is changed to address this suggestion.
	570
	571
	572	The discussion of clusters and pre-emption in particular feels quite rushed.
	573
	574	We believe a brief introduction to the Cforall runtime structure is important
	575	because clustering within a user-level versus distributed system is unusual,
	576	Furthermore, the explanation of preemption is important because several new
	577	languages, like Go and Rust tokio, are not preemptive. Rust threads are
	578	preemptive only because it uses kernel threads, which UNIX preempts.
	579
	580
	581	* Recommend to shorten the comparison on coroutine/generator/threads in
	582	Section 2 to a paragraph with a few examples, or possibly a table
	583	explaining the trade-offs between the constructs
	584
	585	Not done, see below.
	586
	587
	588	* Recommend to clarify the relationship between internal/external
	589	scheduling -- is one more general but more error-prone or low-level?
	590
	591	Done, see below.
	592
	593
	594	There is obviously a lot of overlap between these features, and in
	595	particular between coroutines and generators. As noted in the previous
	596	review, many languages have chosen to offer only generators, and to
	597	build coroutines by stacks of generators invoking one another.
	598
[27125d0]	599	Coroutines built from stacks of generators have problems, such as no symmetric
	600	control-flow. Furthermore, stacks of generators have a problem with the
	601	following programming pattern. logger is a library function called from a
	602	function or a coroutine, where the doit for the coroutine suspends. With stacks
	603	of generators, there has to be a function and generator version of logger to
	604	support these two scenarios. If logger is a library function, it may be
	605	impossible to create the generator logger because the logger function is
[04b4a71]	606	opaque.
	607
	608	#include <fstream.hfa>
	609	#include <coroutine.hfa>
	610
	611	forall( otype T \| { void doit( T ); } )
	612	void logger( T & t ) {
	613	doit( t );
	614	}
	615
	616	coroutine C {};
	617	void main( C & c ) with( c ) {
	618	void doit( C & c ) { suspend; }
	619	logger( c );
	620	}
	621	void mem( C & c ) {
	622	resume( c );
	623	}
	624
	625	int main() {
	626	C c;
	627	mem( c );
	628	mem( c );
	629
	630	struct S {};
	631	S s;
	632	void doit( S & s ) {}
	633	logger( s );
	634	}
	635
[016b1eb]	636	Additional text has been added to the start of Section 3.2 addressing this
	637	comment.
[04b4a71]	638
	639
	640	In fact, the end of Section 2.1 (on page 5) contains a particular paragraph
	641	that embodies this "top down" approach. It starts, "programmers can now
	642	answer three basic questions", and thus gives some practical advice for
	643	which construct you should use and when. I think giving some examples of
	644	specific applications that this paragraph, combined with some examples of
	645	cases where each construct was needed, would be a better approach.
	646
	647	I don't think this comparison needs to be very long. It seems clear enough
	648	that one would
	649
	650	* prefer generators for simple computations that yield up many values,
	651	* prefer coroutines for more complex processes that have significant
	652	internal structure,
	653
[27125d0]	654	We do not believe the number of values yielded is an important factor is
	655	choosing between a generator or coroutine; either form can receive (input) or
	656	return (output) millions of values. As well, simple versus complex computations
[b54118a]	657	is also not an absolute criterion for selection as both forms can be simple or
	658	complex.
	659
	660	As stated in the paper, a coroutine brings generality to the implementation
	661	because of the addition stack, whereas generators have restrictions on standard
	662	software-engineering practices: variable placement, no helper functions without
	663	creating an explicit generator stack, and no symmetric control-flow. It is
	664	true that simple generators probably suffer less from restrictions. Also,
	665	simplicity means the amount of work done is small so the cost of the
	666	suspend/resume becomes a factor, where experiments show suspend/resume for a
	667	generator is 10 times faster than for a coroutines.
	668
[04b4a71]	669
	670	* prefer threads for cases where parallel execution is desired or needed.
	671
	672	Agreed, but this description does not mention mutual exclusion and
	673	synchronization, which is essential in any meaningful concurrent program.
	674
	675	Our point here is to illustrate that a casual "top down" explanation is
	676	insufficient to explain the complexity of the underlying execution properties.
	677	We presented some rule-of-thumbs at the end of Section 2 but programmers must
	678	understand all the underlying mechanisms and their interactions to exploit the
	679	execution properties to their fullest, and to understand when a programming
	680	language does or does not provide a desired mechanism.
	681
	682
	683	I did appreciate the comparison in Section 2.3 between async-await in
	684	JS/Java and generators/coroutines. I agree with its premise that those
	685	mechanisms are a poor replacement for generators (and, indeed, JS has a
	686	distinct generator mechanism, for example, in part for this reason). I
	687	believe I may have asked for this in a previous review, but having read it,
	688	I wonder if it is really necessary, since those mechanisms are so different
	689	in purpose.
	690
[27125d0]	691	Given that you asked about this before, I believe other readers might also ask
	692	the same question because async-await is very popular. So I think this section
	693	does help to position the work in the paper among other work, and hence, it is
	694	appropriate to keep it in the paper.
[04b4a71]	695
	696
	697	I find the motivation for supporting both internal and external scheduling
	698	to be fairly implicit. After several reads through the section, I came to
	699	the conclusion that internal scheduling is more expressive than external
	700	scheduling, but sometimes less convenient or clear. Is this correct? If
	701	not, it'd be useful to clarify where external scheduling is more
	702	expressive.
	703
	704	I would find it very interesting to try and capture some of the properties
	705	that make internal vs external scheduling the better choice.
	706
	707	For example, it seems to me that external scheduling works well if there
	708	are only a few "key" operations, but that internal scheduling might be
	709	better otherwise, simply because it would be useful to have the ability to
	710	name a signal that can be referenced by many methods.
	711
	712	To address this point, the last paragraph on page 22 (now page 23) has been
	713	augmented to the following:
	714
	715	Given external and internal scheduling, what guidelines can a programmer use
	716	to select between them? In general, external scheduling is easier to
	717	understand and code because only the next logical action (mutex function(s))
	718	is stated, and the monitor implicitly handles all the details. Therefore,
	719	there are no condition variables, and hence, no wait and signal, which reduces
	720	coding complexity and synchronization errors. If external scheduling is
	721	simpler than internal, why not use it all the time? Unfortunately, external
	722	scheduling cannot be used if: scheduling depends on parameter value(s) or
	723	scheduling must block across an unknown series of calls on a condition
	724	variable (\ie internal scheduling). For example, the dating service cannot be
	725	written using external scheduling. First, scheduling requires knowledge of
	726	calling parameters to make matching decisions and parameters of calling
	727	threads are unavailable within the monitor. Specifically, a thread within the
	728	monitor cannot examine the @ccode@ of threads waiting on the calling queue to
	729	determine if there is a matching partner. (Similarly, if the bounded buffer
	730	or readers/writer are restructured with a single interface function with a
	731	parameter denoting producer/consumer or reader/write, they cannot be solved
	732	with external scheduling.) Second, a scheduling decision may be delayed
	733	across an unknown number of calls when there is no immediate match so the
	734	thread in the monitor must block on a condition. Specifically, if a thread
	735	determines there is no opposite calling thread with the same @ccode@, it must
	736	wait an unknown period until a matching thread arrives. For complex
	737	synchronization, both external and internal scheduling can be used to take
	738	advantage of best of properties of each.
	739
	740
	741	Consider the bounded buffer from Figure 13: if it had multiple methods for
	742	removing elements, and not just `remove`, then the `waitfor(remove)` call
	743	in `insert` might not be sufficient.
	744
	745	Section 6.4 Extended waitfor shows the waitfor is very powerful and can handle
	746	your request:
	747
	748	waitfor( remove : buffer ); or waitfor( remove2 : buffer );
	749
	750	and its shorthand form (not shown in the paper)
	751
[016b1eb]	752	waitfor( remove, remove2 : buffer );
[04b4a71]	753
[016b1eb]	754	A call to either of these remove functions satisfies the waitfor (exact
	755	selection details are discussed in Section 6.4).
[04b4a71]	756
	757
	758	The same is true, I think, of the `signal_block` function, which I
	759	have not encountered before;
	760
	761	In Tony Hoare's seminal paper on Monitors "Monitors: An Operating System
	762	Structuring Concept", it states on page 551:
	763
	764	When a process signals a condition on which another process is waiting, the
	765	signalling process must wait until the resumed process permits it to
	766	proceed. We therefore introduce for each monitor a second semaphore "urgent"
	767	(initialized to 0), on which signalling processes suspend themselves by the
	768	operation P(urgent).
	769
	770	Hence, the original definition of signal is in fact signal_block, i.e., the
	771	signaller blocks. Later implementations of monitor switched to signaller
	772	nonblocking because most signals occur before returns, which allows the
	773	signaller to continue execution, exit the monitor, and run concurrently with
	774	the signalled thread that restarts in the monitor. When the signaller is not
	775	going to exit immediately, signal_block is appropriate.
	776
	777
	778	it seems like its behavior can be modeled with multiple condition
	779	variables, but that's clearly more complex.
	780
	781	Yes. Buhr, Fortier and Coffin show in Monitor Classification, ACM Computing
	782	Surveys, 27(1):63-107, that all extant monitors with different signalling
	783	semantics can be transformed into each other. However, some transformations are
	784	complex and runtime expensive.
	785
	786
	787	One question I had about `signal_block`: what happens if one signals
	788	but no other thread is waiting? Does it block until some other thread
	789	waits? Or is that user error?
	790
	791	On page 20, it states:
	792
	793	Signalling is unconditional because signalling an empty condition queue does
	794	nothing.
	795
	796	To the best of our knowledge, all monitors have the same semantics for
	797	signalling an empty condition queue, regardless of the kind of signal, i.e.,
	798	signal or signal_block.
	799
[016b1eb]	800
[04b4a71]	801	I believe that one difference between the Go program and the Cforall
	802	equivalent is that the Goroutine has an associated queue, so that
	803	multiple messages could be enqueued, whereas the Cforall equivalent is
	804	effectively a "bounded buffer" of length 1. Is that correct?
	805
	806	Actually, the buffer length is 0 for the Cforall call and the Go unbuffered
	807	send so both are synchronous communication.
	808
[016b1eb]	809
[04b4a71]	810	I think this should be stated explicitly. (Presumably, one could modify the
	811	Cforall program to include an explicit vector of queued messages if
	812	desired, but you would also be reimplementing the channel abstraction.)
	813
	814	Fixed, by adding the following sentences:
	815
	816	The different between call and channel send occurs for buffered channels
	817	making the send asynchronous. In \CFA, asynchronous call and multiple
	818	buffers is provided using an administrator and worker
	819	threads~\cite{Gentleman81} and/or futures (not discussed).
	820
	821
	822	Also, in Figure 20, I believe that there is a missing `mutex` keyword.
	823
	824	Fixed.
	825
	826
	827	I was glad to see that the paper acknowledged that Cforall still had
	828	low-level atomic operations, even if their use is discouraged in favor of
	829	higher-level alternatives.
	830
	831	There was never an attempt to not acknowledged that Cforall had low-level
	832	atomic operations. The original version of the paper stated:
	833
	834	6.6 Low-level Locks
	835	For completeness and efficiency, Cforall provides a standard set of low-level
	836	locks: recursive mutex, condition, semaphore, barrier, etc., and atomic
	837	instructions: fetchAssign, fetchAdd, testSet, compareSet, etc.
	838
	839	and that section is still in the paper. In fact, we use these low-level
	840	mechanisms to build all of the high-level concurrency constructs in Cforall.
	841
	842
	843	However, I still feel that the conclusion overstates the value of the
	844	contribution here when it says that "Cforall high-level race-free monitors
	845	and threads provide the core mechanisms for mutual exclusion and
	846	synchronization, without the need for volatile and atomics". I feel
	847	confident that Java programmers, for example, would be advised to stick
	848	with synchronized methods whenever possible, and it seems to me that they
	849	offer similar advantages -- but they sometimes wind up using volatiles for
	850	performance reasons.
	851
	852	I think we are agreeing violently. 99.9% of Java/Cforall/Go/Rust concurrent
	853	programs can achieve very good performance without volatile or atomics because
	854	the runtime system has already used these low-level capabilities to build an
	855	efficient set of high-level concurrency constructs.
	856
	857	0.1% of the time programmers need to build their own locks and synchronization
	858	mechanisms. This need also occurs for storage management. Both of these
	859	mechanisms are allowed in Cforall but are fraught with danger and should be
	860	discouraged. Specially, it takes a 7th dan Black Belt programmer to understand
	861	fencing for a WSO memory model, such as on the ARM. It doesn't help that the
	862	C++ atomics are baroque and incomprehensible. I'm sure Hans Boehm, Doug Lea,
	863	Dave Dice and me would agree that 99% of hand-crafted locks created by
	864	programmers are broken and/or non-portable.
	865
	866
	867	I was also confused by the term "race-free" in that sentence. In
	868	particular, I don't think that Cforall has any mechanisms for preventing
	869	data races, and it clearly doesn't prevent "race conditions" (which would
	870	bar all sorts of useful programs). I suppose that "race free" here might be
	871	referring to the improvements such as removing barging behavior.
	872
	873	We use the term "race free" to mean the same as Boehm/Adve's "data-race
	874	freedom" in
	875
	876	Boehm Hans-J., Adve Sarita V., You Don't Know Jack About Shared Variables or
	877	Memory Models. Communications ACM. 2012; 55(2):48-54.
	878	https://queue.acm.org/detail.cfm?id=2088916
	879
	880	which is cited in the paper. Furthermore, we never said that Cforall has
	881	mechanisms for preventing all data races. We said Cforall high-level
	882	race-free monitors and threads[, when used with mutex access function] (added
	883	to the paper), implies no data races within these constructs, unless a
	884	programmer directly publishes shared state. This approach is exactly what
	885	Boehm/Adve advocate for the vast majority of concurrent programming.
	886
	887
	888	It would perhaps be more interesting to see a comparison built using [tokio] or
	889	[async-std], two of the more prominent user-space threading libraries that
	890	build on Rust's async-await feature (which operates quite differently than
	891	Javascript's async-await, in that it doesn't cause every aync function call to
	892	schedule a distinct task).
	893
	894	Done.
	895
	896
	897	Several figures used the `with` keyword. I deduced that `with(foo)` permits
	898	one to write `bar` instead of `foo.bar`. It seems worth introducing.
	899	Apologies if this is stated in the paper, if so I missed it.
	900
	901	Page 6, footnote F states:
	902
	903	The Cforall "with" clause opens an aggregate scope making its fields directly
	904	accessible, like Pascal "with", but using parallel semantics; multiple
	905	aggregates may be opened.
	906
	907
	908	On page 20, section 6.3, "external scheduling and vice versus" should be
	909	"external scheduling and vice versa".
	910
	911	Fixed.
	912
	913
	914	On page 5, section 2.3, the paper states "we content" but it should be "we
	915	contend".
	916
	917	Fixed.
	918
	919
	920	Page 1. I don't believe that it s fair to imply that Scala is "research
	921	vehicle" as it is used by major players, Twitter being the most prominent
	922	example.
	923
	924	Fixed. Removed "research".
	925
	926
	927	Page 15. Must Cforall threads start after construction (e.g. see your
	928	example on page 15, line 21)?
	929
	930	Yes. Our experience in Java, uC++ and Cforall is that 95% of the time
	931	programmers want threads to start immediately. (Most Java programs have no code
	932	between a thread declaration and the call to start the thread.) Therefore,
	933	this semantic should be the default because (see page 13):
	934
	935	Alternatives, such as explicitly starting threads as in Java, are repetitive
	936	and forgetting to call start is a common source of errors.
	937
	938	To handle the other 5% of the cases, there is a trivial Cforall pattern
[27125d0]	939	providing Java-style start/join. The additional cost for this pattern is small
	940	in comparison to the work performed between the start and join.
[04b4a71]	941
	942	thread T {};
	943	void start( T & mutex ) {} // any function name
	944	void join( T & mutex ) {} // any function name
	945	void main( T & t ) {
	946	sout \| "start";
	947	waitfor( start : t ); // wait to be started
	948	sout \| "restart"; // perform work
	949	waitfor( join : t ); // wait to be joined
	950	sout \| "join";
	951	}
[016b1eb]	952
[04b4a71]	953	int main() {
	954	T t[3]; // threads start and delay
	955	sout \| "need to start";
	956	for ( i; 3 ) start( t[i] );
	957	sout \| "need to join";
	958	for ( i; 3 ) join( t[i] );
	959	sout \| "threads stopped";
	960	} // threads deleted from stack
	961
	962	$ a.out
	963	need to start
	964	start
	965	start
	966	start
	967	need to join
	968	restart
	969	restart
	970	restart
	971	threads stopped
	972	join
	973	join
	974	join
	975
	976
	977	Page 18, line 17: is using
	978
	979	Fixed.

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format