Index: doc/papers/concurrency/Paper.tex
===================================================================
--- doc/papers/concurrency/Paper.tex	(revision 686f731009686a6516b3765a330ea882719321f4)
+++ doc/papers/concurrency/Paper.tex	(revision eb28d7ea94b43621e0966e9820a806a677e2e038)
@@ -241,14 +241,15 @@
 \corres{*Peter A. Buhr, Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, Waterloo, ON, N2L 3G1, Canada. \email{pabuhr{\char`\@}uwaterloo.ca}}
 
-\fundingInfo{Natural Sciences and Engineering Research Council of Canada}
+% \fundingInfo{Natural Sciences and Engineering Research Council of Canada}
 
 \abstract[Summary]{
-\CFA is a modern, polymorphic, non-object-oriented, backwards-compatible extension of the C programming language.
-This paper discusses some advanced control-flow and concurrency/parallelism features in \CFA, along with the supporting runtime.
-These features are created from scratch because they do not exist in ISO C, or are low-level and/or unimplemented, so C programmers continue to rely on library features, like C pthreads.
-\CFA introduces language-level control-flow mechanisms, like coroutines, user-level threading, and monitors for mutual exclusion and synchronization.
-A unique contribution of this work is allowing multiple monitors to be safely acquired \emph{simultaneously} (deadlock free), while integrating this capability with monitor synchronization mechanisms.
+\CFA is a polymorphic, non-object-oriented, concurrent, backwards-compatible extension of the C programming language.
+This paper discusses the design philosophy and implementation of its advanced control-flow and concurrency/parallelism features, along with the supporting runtime.
+These features are created from scratch as ISO C has only low-level and/or unimplemented concurrency, so C programmers continue to rely on library features, like C pthreads.
+\CFA introduces modern language-level control-flow mechanisms, like coroutines, user-level threading, and monitors for mutual exclusion and synchronization.
+The design contributions provide significant programmer simplification and safety by eliminating spurious wakeup and barging in monitors.
+As well, multiple monitors can be safely acquired \emph{simultaneously} (deadlock free), which is fully integrated with the monitor synchronization mechanisms.
 These features also integrate with the \CFA polymorphic type-system and exception handling, while respecting the expectations and style of C programmers.
-Experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
+Experimental results show comparable performance of the new features with similar (weaker) mechanisms in other concurrent programming-languages.
 }%
 
@@ -264,5 +265,5 @@
 \section{Introduction}
 
-This paper discusses the design of language-level control-flow and concurrency/parallelism extensions in \CFA and its runtime.
+This paper discusses the design philosophy and implementation of advanced language-level control-flow and concurrency/parallelism extensions in \CFA and its runtime.
 \CFA is a modern, polymorphic, non-object-oriented\footnote{
 \CFA has features often associated with object-oriented programming languages, such as constructors, destructors, virtuals and simple inheritance.
@@ -275,5 +276,5 @@
 no high-level language concurrency features are defined.
 Interestingly, almost a decade after publication of the \Celeven standard, neither gcc-8, clang-8 nor msvc-19 (most recent versions) support the \Celeven include @threads.h@, indicating little interest in the C11 concurrency approach.
-Finally, while the \Celeven standard does not state a concurrent threading-model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
+Finally, while the \Celeven standard does not state a threading model, the historical association with pthreads suggests implementations would adopt kernel-level threading (1:1)~\cite{ThreadModel}.
 
 In contrast, there has been a renewed interest during the past decade in user-level (M:N, green) threading in old and new programming languages.
@@ -281,5 +282,5 @@
 Kernel threading was chosen, largely because of its simplicity and fit with the simpler operating systems and hardware architectures at the time, which gave it a performance advantage~\cite{Drepper03}.
 Libraries like pthreads were developed for C, and the Solaris operating-system switched from user (JDK 1.1~\cite{JDK1.1}) to kernel threads.
-As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopted the 1:1 kernel-threading model, with a variety of presentation mechanisms.
+As a result, languages like Java, Scala~\cite{Scala}, Objective-C~\cite{obj-c-book}, \CCeleven~\cite{C11}, and C\#~\cite{Csharp} adopt the 1:1 kernel-threading model, with a variety of presentation mechanisms.
 From 2000 onwards, languages like Go~\cite{Go}, Erlang~\cite{Erlang}, Haskell~\cite{Haskell}, D~\cite{D}, and \uC~\cite{uC++,uC++book} have championed the M:N user-threading model, and many user-threading libraries have appeared~\cite{Qthreads,MPC,BoostThreads}, including putting green threads back into Java~\cite{Quasar}.
 The main argument for user-level threading is that they are lighter weight than kernel threads (locking and context switching do not cross the kernel boundary), so there is less restriction on programming styles that encourage large numbers of threads performing smaller work-units to facilitate load balancing by the runtime~\cite{Verch12}.
@@ -287,21 +288,22 @@
 Finally, performant user-threading implementations (both time and space) are largely competitive with direct kernel-threading implementations, while achieving the programming advantages of high concurrency levels and safety.
 
-A further effort over the past decade is the development of language memory-models to deal with the conflict between certain language features and compiler/hardware optimizations.
-This issue can be rephrased as: some language features are pervasive (language and runtime) and cannot be safely added via a library to prevent invalidation by sequential optimizations~\cite{Buhr95a,Boehm05}.
-The consequence is that a language must be cognizant of these features and provide sufficient tools to program around any safety issues.
-For example, C created the @volatile@ qualifier to provide correct execution for @setjmp@/@logjmp@ (concurrency came later).
-The common solution is to provide a handful of complex qualifiers and functions (e.g., @volatile@ and atomics) allowing programmers to write consistent/race-free programs, often in the sequentially-consistent memory-model~\cite{Boehm12}.
-
-While having a sufficient memory-model allows sound libraries to be constructed, writing these libraries can quickly become awkward and error prone, and using these low-level libraries has the same issues.
-Essentially, using low-level explicit locks is the concurrent equivalent of assembler programming.
-Just as most assembler programming is replaced with high-level programming, explicit locks can be replaced with high-level concurrency in a programming language.
-Then the goal is for the compiler to check for correct usage and follow any complex coding conventions implicitly.
-The drawback is that language constructs may preclude certain specialized techniques, therefore introducing inefficiency or inhibiting concurrency.
-For most concurrent programs, these drawbacks are insignificant in comparison to the speed of composition, and subsequent reliability and maintainability of the high-level concurrent program.
-(The same is true for high-level programming versus assembler programming.)
-Only very rarely should it be necessary to drop down to races and/or explicit locks to apply a specialized technique to achieve maximum speed or concurrency.
-As stated, this observation applies to non-concurrent forms of complex control-flow, like exception handling and coroutines.
-
-Adapting the programming language to these features also allows matching the control-flow model with the programming-language style, versus adopting one general (sound) library/paradigm.
+A further effort over the past two decades is the development of language memory-models to deal with the conflict between language features and compiler/hardware optimizations, i.e., some language features are unsafe in the presence of aggressive sequential optimizations~\cite{Buhr95a,Boehm05}.
+The consequence is that a language must provide sufficient tools to program around safety issues, as inline and library code is all sequential to the compiler.
+One solution is low-level qualifiers and functions (e.g., @volatile@ and atomics) allowing \emph{programmers} to explicitly write safe (race-free~\cite{Boehm12}) programs.
+A safer solution is high-level language constructs so the \emph{compiler} knows the optimization boundaries, and hence, provides implicit safety.
+This problem is best know with respect to concurrency, but applies to other complex control-flow, like exceptions\footnote{
+\CFA exception handling will be presented in a separate paper.
+The key feature that dovetails with this paper is non-local exceptions allowing exceptions to be raised across stacks, with synchronous exceptions raised among coroutines and asynchronous exceptions raised among threads, similar to that in \uC~\cite[\S~5]{uC++}
+} and coroutines.
+Finally, solutions in the language allows matching constructs with language paradigm, i.e., imperative and functional languages have different presentations of the same concept.
+
+Finally, it is important for a language to provide safety over performance \emph{as the default}, allowing careful reduction of safety for performance when necessary.
+Two concurrency violations of this philosophy are \emph{spurious wakeup} and \emph{barging}, i.e., random wakeup~\cite[\S~8]{Buhr05a} and signalling-as-hints~\cite[\S~8]{Buhr05a}, where one begats the other.
+If you believe spurious wakeup is a foundational concurrency property, than unblocking (signalling) a thread is always a hint.
+If you \emph{do not} believe spurious wakeup is foundational, than signalling-as-hints is a performance decision.
+Most importantly, removing spurious wakeup and signals-as-hints makes concurrent programming significantly safer because it removes local non-determinism.
+Clawing back performance where the local non-determinism is unimportant, should be an option not the default.
+
+\begin{comment}
 For example, it is possible to provide exceptions, coroutines, monitors, and tasks as specialized types in an object-oriented language, integrating these constructs to allow leveraging the type-system (static type-checking) and all other object-oriented capabilities~\cite{uC++}.
 It is also possible to leverage call/return for blocking communication via new control structures, versus switching to alternative communication paradigms, like channels or message passing.
@@ -321,14 +323,22 @@
 Hence, rewriting and retraining costs for these languages, even \CC, are prohibitive for companies with a large C software-base.
 \CFA with its orthogonal feature-set, its high-performance runtime, and direct access to all existing C libraries circumvents these problems.
-
-We present comparative examples so the reader can judge if the \CFA control-flow extensions are equivalent or better than those in or proposed for \Celeven, \CC and other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
-The detailed contributions of this work are:
+\end{comment}
+
+\CFA embraces user-level threading, language extensions for advanced control-flow, and safety as the default.
+We present comparative examples so the reader can judge if the \CFA control-flow extensions are better and safer than those in or proposed for \Celeven, \CC and other concurrent, imperative programming languages, and perform experiments to show the \CFA runtime is competitive with other similar mechanisms.
+The main contributions of this work are:
 \begin{itemize}
 \item
-allowing multiple monitors to be safely acquired \emph{simultaneously} (deadlock free), while seamlessly integrating this capability with all monitor synchronization mechanisms.
+expressive language-level coroutines and user-level threading, which respect the expectations of C programmers.
 \item
-all control-flow features respect the expectations of C programmers, with statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features.
+monitor synchronization without barging.
 \item
-experimental results show comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
+safely acquiring multiple monitors \emph{simultaneously} (deadlock free), while seamlessly integrating this capability with all monitor synchronization mechanisms.
+\item
+providing statically type-safe interfaces that integrate with the \CFA polymorphic type-system and other language features.
+\item
+a runtime system with no spurious wakeup.
+\item
+experimental results showing comparable performance of the new features with similar mechanisms in other concurrent programming-languages.
 \end{itemize}